# Data Science Essentials: Data Cleaning
    <Name>
    <Class>
    <Date>
    

In [49]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np

### Problem 1

The g\_t\_results.csv file is a set of parent-reported scores on their child's Gifted and Talented tests. 
The two tests, OLSAT and NNAT, are used by NYC to determine if children are qualified for gifted programs.
The OLSAT Verbal has 16 questions for Kindergardeners and 30 questions for first, second, and third graders.
The NNAT has 48 questions. 
Using this dataset, answer the following questions.


1) What column has the highest number of null values and what percent of its values are null? Print the answer as a tuple with (column name, percentage). Make sure the second value is a percent.

2) List the columns that should be numeric that aren't. Print the answer as a tuple.

3) How many third graders have scores outside the valid range for the OLSAT Verbal Score? Print the answer

4) How many data values are missing (NaN)? Print the number.

Each part is one point.

In [54]:
# Part One
results = pd.read_csv('g_t_results.csv')
ratio = results.isna().sum() / results.shape[0]

bad_col = ratio.idxmax()
perc = ratio.max()
output = (bad_col, perc * 100)

print("Part One")
print(output)


# Part Two
output = ("OLSAT Verbal Score", "OLSAT Verbal Percentile", "NNAT Non Verbal Raw Score")
print("\nPart Two")
print(output)


# Part Three
scores = np.array(results[results['Entering Grade Level'] == '3']['OLSAT Verbal Score'])
scores = scores.astype(int)
bad_grades = sum(scores > 30) + sum(scores < 0)
print("\nPart Three")
print(bad_grades)


# Part Four
bad_val = results.isna().sum().sum()
print("\nPart Four")
print(bad_val)


Part One
('School Assigned', 75.21367521367522)

Part Two
('OLSAT Verbal Score', 'OLSAT Verbal Percentile', 'NNAT Non Verbal Raw Score')

Part Three
1

Part Four
192


### Problem 2

imdb.csv contains a small set of information about 99 movies. Clean the data set by doing the following in order: 

1) Remove duplicate rows by dropping the first **or** last. Print the shape of the dataframe after removing the rows.

2) Drop all rows that contain missing data. Print the shape of the dataframe after removing the rows.

3) Remove rows that have data outside valid data ranges and explain briefly how you determined your ranges for each column.

4) Identify and drop columns with three or fewer different values. Print a tuple with the names of the columns dropped.

5) Convert the titles to all lower case.

Print the first five rows of your dataframe.

In [62]:
imdb = pd.read_csv('imdb.csv')

# 1 remove duplicates
imdb = imdb.drop_duplicates(keep='first')
print(imdb.shape)

# 2 remove missing data
imdb = imdb.dropna()
print(imdb.shape)

# 3 remove invalid rows
# movies should be longer than 30 minutes and shorter than 300 minutes
imdb.drop(imdb[imdb['duration'] < 30].index, inplace=True)
imdb.drop(imdb[imdb['duration'] > 300].index, inplace=True)
# movies should cost more than 10 grand and make more than 10 grand
imdb.drop(imdb[imdb['gross'] < 10000].index, inplace=True)
imdb.drop(imdb[imdb['budget'] < 10000].index, inplace=True)
# movies did not exist before 1887
imdb.drop(imdb[imdb['title_year'] < 1887].index, inplace=True)
# movies should have more than a few facebook likes
imdb.drop(imdb[imdb['movie_facebook_likes'] < 1000].index, inplace=True)
print("See comments for part three")

# 4 remove boring columns
boring = list(imdb.nunique()[imdb.nunique() <= 3].index)
print(boring)
imdb = imdb.drop(boring, axis=1)

# 5 lower case
imdb['movie_title'] = imdb['movie_title'].str.lower()

# print the first five rows
print(imdb.head())

(93, 13)
(64, 13)
See comments for part three
['color', 'language']
       director_name  duration        gross  \
0    Martin Scorsese       240  116866727.0   
1        Shane Black       195  408992272.0   
2  Quentin Tarantino       187   54116191.0   
4      Peter Jackson       186  258355354.0   
8        Joss Whedon       173  623279547.0   

                                 genres                          movie_title  \
0          Biography|Comedy|Crime|Drama              the wolf of wall street   
1               Action|Adventure|Sci-Fi                           iron man 3   
2  Crime|Drama|Mystery|Thriller|Western                    the hateful eight   
4                     Adventure|Fantasy  the hobbit: the desolation of smaug   
8               Action|Adventure|Sci-Fi                         the avengers   

   title_year country       budget  imdb_score  \
0        2013     USA  100000000.0         8.2   
1        2013     USA  200000000.0         7.2   
2        2015     

### Problem 3

basketball.csv contains data for all NBA players between 2001 and 2018.
Each row represents a player's stats for a year.

Create two new features:

    career_length (int): number of years player has been playing (start at 0).
    
    target (str): The target team if the player is leaving. If the player is retiring, the target should be 'retires'.
                  A player is retiring if their name doesn't exist the next year.
                  (Set the players in 2019 to NaN).

Remove all duplicate players in each year.
Remove all rows except those where a player changes team, that is, target is not null nor 'retires'.

Drop the player, year, and team_id columns.

Return the first 10 lines of your dataframe.

In [72]:
ball = pd.read_csv("basketball.csv") 

# drop duplicates and sort by year
ball.drop_duplicates(subset=["player","year"],keep="first",inplace=True)
ball.sort_values(by="year",inplace=True)

# make new features
ball['career_length'] = 0.0
ball['target'] = None


# iterate over each player to fill in career_length and target
baller_names = ball["player"].unique()
for name in baller_names:
    # create a mask for each player
    player = ball["player"] == name
    # calculate career length
    num_years = player.sum()

    # fill in career_length
    ball.loc[player, "career_length"] = np.arange(num_years, dtype=int)

    # fill in target
    retirement = num_years - 1
    for i, index in enumerate(ball[player].index):
        # if the player retires, set target to "retires"
        if i == retirement: 
            ball.loc[index,"target"] = "retires"

        # if the player switches teams, set target to the next team
        elif ball[player].iloc[i]["team_id"] != ball[player].iloc[i+1]["team_id"]:
            ball.loc[index,"target"] = ball[player].iloc[i+1]["team_id"]

# remove null values and retired players
ball = ball[~ball["target"].isna()]
ball = ball[ball["target"] != 'retires']

# drop boring columns and resort
ball.drop(columns=["player","year","team_id"], inplace=True)
ball.sort_index(inplace=True)

ball.head(10)

Unnamed: 0,age,per,ws,bpm,career_length,target
453,27,8.2,1.0,-2.5,5.0,PHO
461,24,13.0,1.2,-0.9,2.0,ATL
462,24,15.9,6.2,2.9,3.0,MEM
464,33,12.7,3.7,-1.9,14.0,HOU
467,32,11.8,5.3,0.7,13.0,PHO
477,29,7.5,1.1,-2.8,9.0,MIN
482,31,14.1,1.9,-0.2,10.0,SAS
489,25,14.1,2.9,-2.4,6.0,CHO
490,29,12.6,2.8,0.1,2.0,SAC
493,28,13.0,0.0,-3.2,7.0,MIL


### Problem 4

Load housing.csv into a dataframe with index=0. Descriptions of the features are in housing_data_description.txt.  
The goal is to construct a regression model that predicts SalePrice using the other features of the dataset.  Do this as follows:

	1) Identify and handle the missing data.  Hint: Dropping every row with some missing data is not a good choice because it gives you an empty dataframe.  What can you do instead?
    FIXME
	2) Identify the variable with nonnumeric values that are misencoded as numbers.  One-hot encode it. Hint: don't forget to remove one of the encoded columns to prevent collinearity with the constant column (which you will add later).
    
    3) Add a constant column to the dataframe.

    4) Save a copy of the dataframe.

	5) Choose four categorical features that seem very important in predicting SalePrice. One-hot encode these features and remove all other categorical features.
		
	6) Run an OLS using all numerical data regression on your model.  

	
Print the ten features that have the highest coef in your model and the summary. Don't print the OLS

In [87]:
pd.set_option('display.max_rows', None)
housing = pd.read_csv("housing.csv", index_col=0)

# part one
housing.dropna(inplace=True, thresh=len(housing)*0.06, axis=1)

# part two
housing = pd.get_dummies(housing, columns=['MSSubClass'], drop_first=True)

# part three
housing["constant"] = 1

# part four
selection = housing.copy()


# part five
features = ['Neighborhood', 'MSZoning', 'OverallQual', 'OverallCond']

# perform one hot encoding
for feature in features:
    selection[feature] = selection[feature].fillna("None", inplace=True)
selection = pd.get_dummies(selection, columns=features, drop_first=True)

# remove all other categorical features
remove_cols = [col for col in selection.columns if selection[col].dtype != 'object']
selection = selection[remove_cols].fillna(0)


# part six
y = selection['SalePrice']
X = selection.loc[:,selection.columns != 'SalePrice']
results = sm.OLS(y, X).fit()

# get summary
summary = results.summary()
# Convert the summary table to a dataframe
results_as_html = summary.tables[1].as_html()
result_df = pd.read_html(results_as_html, header=0, index_col=0)[0]
# output the top 10 features
result_df.sort_values(['coef'], ascending=False)[:10]


Unnamed: 0,coef,std err,t,P>|t|,[0.025,0.975]
MSSubClass_40,20700.0,19100.0,1.086,0.278,-16700.0,58100.0
MSSubClass_45,19850.0,11400.0,1.745,0.081,-2470.18,42200.0
GarageCars,17480.0,3209.703,5.445,0.0,11200.0,23800.0
MSSubClass_30,7934.4901,6009.133,1.32,0.187,-3853.284,19700.0
Fireplaces,7906.0359,1898.388,4.165,0.0,4182.076,11600.0
BsmtFullBath,7082.187,2827.513,2.505,0.012,1535.615,12600.0
FullBath,6711.2647,3037.933,2.209,0.027,751.924,12700.0
TotRmsAbvGrd,6271.2262,1334.474,4.699,0.0,3653.465,8888.988
MSSubClass_70,4509.8385,7802.471,0.578,0.563,-10800.0,19800.0
BsmtHalfBath,2742.6837,4405.857,0.623,0.534,-5900.036,11400.0


### Problem 5

Using the copy of the dataframe you created in Problem 4, one-hot encode all the categorical variables.
Print the shape of the dataframe and run OLS.

Print the ten features that have the highest coef in your model and the summary.
Write a couple of sentences discussing which model is better and why.

In [89]:
# redo part four
housing = pd.read_csv("housing.csv", index_col=0)
housing.dropna(inplace=True, thresh=len(housing)*0.06, axis=1)
housing = pd.get_dummies(housing, columns=['MSSubClass'], drop_first=True)
housing["constant"] = 1

# one hot encode all categorical features

# get categorical features
cat_col = [col for col in housing.columns if housing[col].dtype == 'object']
# remove null values
for col in cat_col:
    housing[col] = housing[col].fillna("None", inplace=True)

# perform one hot encoding on all categorical features
hot_housing = pd.get_dummies(housing, columns=cat_col, drop_first=True)
# remove null values    
hot_housing.fillna(0, inplace=True)


# print shape
print(hot_housing.shape)

# perform OLS
y = hot_housing['SalePrice']
X = hot_housing.loc[:,hot_housing.columns != 'SalePrice']
results = sm.OLS(y, X).fit()

# get summary
summary = results.summary()
# Convert the summary table to a dataframe
results_as_html = summary.tables[1].as_html()
result_df = pd.read_html(results_as_html, header=0, index_col=0)[0]
# output the top 10 features
result_df.sort_values(['coef'], ascending=False)[:10]
result_df


(1460, 51)


Unnamed: 0,coef,std err,t,P>|t|,[0.025,0.975]
LotFrontage,0.1561,28.643,0.005,0.996,-56.032,56.344
LotArea,0.3617,0.101,3.588,0.0,0.164,0.56
OverallQual,17080.0,1214.739,14.057,0.0,14700.0,19500.0
OverallCond,5056.5526,1040.348,4.86,0.0,3015.758,7097.347
YearBuilt,493.9793,82.711,5.972,0.0,331.73,656.228
YearRemodAdd,92.4128,67.089,1.377,0.169,-39.191,224.017
MasVnrArea,30.6202,5.993,5.11,0.0,18.865,42.376
BsmtFinSF1,8.8166,2.525,3.492,0.0,3.864,13.769
BsmtFinSF2,2.0161,4.497,0.448,0.654,-6.805,10.838
BsmtUnfSF,-2.1628,2.445,-0.885,0.377,-6.959,2.633


The model output in problem 4 is better because it has six statistically significant regressors, with a p-value less than 0.05. Additionally, it is not overfit to a specific dataset as it only includes relevant features. Therefore irrelevant features do not dilute important ones, thus enhancing predictive power. Although the model suffers from omitted variables bias, the significance of the selected regressors mitigates this bias for the most part.