# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


In [2]:
import pymysql 
from sqlalchemy import create_engine

#### 3. Create a mysql engine to set the connection to the server. Check the connection details in [this link](https://relational.fit.cvut.cz/dataset/Stats).

In [3]:
# Create engine  

# hostname: relational.fit.cvut.cz
# port: 3306
# username: guest
# password: relational
engine = create_engine('mysql+pymysql://guest:relational@relational.fit.cvut.cz:3306/stats') 


#### 4. Import the users table.

In [4]:
sql_query = 'SELECT * FROM users;' 
users_raw = pd.read_sql_query(sql_query, engine)

#### 5. Rename Id column to userId.

In [5]:
# Rename Id column 
users = users_raw.rename(columns={'Id':'userId'}) 

#### 6. Import the posts table. 

In [6]:
sql = 'SELECT * FROM posts;' 
posts_raw = pd.read_sql_query(sql,engine)

#### 7. Rename Id column to postId and OwnerUserId to userId.

In [7]:
posts = posts_raw.rename(columns={'Id': 'postId','OwnerUserId':'userId'}) 

#### 8. Define new dataframes for users and posts with the following selected columns:
**users columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts columns**: postId, Score, userID, ViewCount, CommentCount

In [8]:
users_sub = users[['userId','Reputation', 'Views', 'UpVotes', 'DownVotes']] 
posts_sub = posts[['postId', 'Score', 'userId', 'ViewCount', 'CommentCount']]


#### 9. Merge the new dataframes you have created, of users and posts. 
You will need to make an inner [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [9]:
users_posts = pd.merge(posts_sub, users_sub, how='inner', on='userId') 
users_posts.head()

Unnamed: 0,postId,Score,userId,ViewCount,CommentCount,Reputation,Views,UpVotes,DownVotes
0,1,23,8.0,1278.0,1,6764,1089,604,25
1,16,16,8.0,,3,6764,1089,604,25
2,36,41,8.0,67396.0,7,6764,1089,604,25
3,65,14,8.0,,3,6764,1089,604,25
4,78,33,8.0,,4,6764,1089,604,25


#### 10. How many missing values do you have in your merged dataframe? On which columns?

In [10]:
total_missing = users_posts.isnull().sum().sum()
print(f"In total there are {total_missing} values in the merged dataframe") 
 
null_cols = users_posts.isnull().sum()
null_cols[null_cols > 0]


In total there are 48396 values in the merged dataframe


ViewCount    48396
dtype: int64

#### 11. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [12]:
# First I want to eveluate the original posts a little bit more to understand the viewcount better 

posts[posts['ViewCount'].isna()] 

# Looks like all ViewCounts that have PostTypeId != 1have no value for ViewCount. Let's confirm this
null_views = posts[posts['ViewCount'].isna()]
post_type = null_views[null_views['PostTypeId'] == 1]
print(post_type)

''' Missing values have something to deal with the PostTypeIds, for now I will drop the rows in the null values as there is no way of 
easily estimate meaningul viewcount for posts. I will do further investigations what do the different PostTypeIds represent and 
then make changes if required 
'''  
print("Original dataframe", users_posts.shape)
users_posts = users_posts.dropna()
print("New dataframe with dropped values", users_posts.shape)

Empty DataFrame
Columns: [postId, PostTypeId, AcceptedAnswerId, CreaionDate, Score, ViewCount, Body, userId, LasActivityDate, Title, Tags, AnswerCount, CommentCount, FavoriteCount, LastEditorUserId, LastEditDate, CommunityOwnedDate, ParentId, ClosedDate, OwnerDisplayName, LastEditorDisplayName]
Index: []

[0 rows x 21 columns]
Original dataframe (90584, 9)
New dataframe with dropped values (42188, 9)


#### 12. Adjust the data types in order to avoid future issues. Which ones should be changed? 