# Choose Your ML Problem and Data

Implement a model to solve a machine learning problem of your choosing. First, you will have to make some decisions, such as which model to choose and which data preparation techniques may be necessary, and formulate a project plan accordingly. 

In this project, you will select a data set and choose a predictive problem that the data set supports. You will then inspect the data with your problem in mind and begin to formulate your  project plan. You will create this project plan in the written assignment that follows.


### Import Packages

Before you get started, import a few packages. You can import additional packages that you have used in this course that you may need for this task.

In [68]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.metrics import mean_squared_error, r2_score

## Step 1: Choose Your Data Set and Load the Data

You will have the option to choose one of four data sets that you have worked with in this program:

* The "adult" data set that contains Census information from 1994: `adultData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load the Data Set

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "adultData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(WHRDataSet_filename, header=0 )

df.head()

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
0,Afghanistan,2008,3.72359,7.16869,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.92969,-1.655084,1.774662,0.4766,,,
1,Afghanistan,2009,4.401778,7.33379,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,,,0.441906
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.13763,0.706766,0.618265,0.275324,0.299357,-1.99181,-1.617176,1.878622,0.394803,,,0.327318
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221,1.78536,0.465942,,,0.336764
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.77562,0.710385,0.267919,0.43544,-1.842996,-1.404078,1.798283,0.475367,,,0.34454


The label for the dataset I have chosen is the "confidence in national government". It is a regression problem as I am trying to predict the percent confidence people have in govenment.

## Step 2: Choose Your Predictive Problem and Label 

Now that you have chosen your data set, you can: 

1. Choose what you would like to predict (i.e. the label) 
2. Identify your problem type: is it a classification or regression problem?

<b>Task:</b> In the markdown cell below, state what you are predicting (the label) and whether this is a classification or regression problem.

The label for the data I have chosen is "Perceptions of corruption". This is a regression type of problem as we are predicting the percent of percent of corruption given the data we have. 

## Step 3: Inspect Your Data

In the code cell below, use some of the techniques you have learned in this course to take a look at your data. As you are investigating your data, consider the following to help you formulate your project plan:

1. What are my features?
5. Which model (or models) should I select that is appropriate for my machine learning problem and data?
6. Which data preparation techniques may be needed for my model (e.g. perform one-hot encoding)?
7. Which techniques should I use to evaluate my model's performance and improve my model?

<b>Task</b>: Use the techniques you have learned in this course to inspect your data.



In the dataset that I have there aren’t really a lot of features, so it is possible to use all the features.
I am going to use a K-nearest regressor model and Linear Regression for this dataset.
I am going to remove all the null values and drop the columns that have too many null values, so that my data is clean and preprocessed.
The evaluation matrix for my model would be RMSE, MSE and R2 scores. 
0.20 would be my test size for this model, I will most likely use grid search and K Folds for my model as my dataset is quite small and having separate sets for validation and test seems quite hard, also I will train two models and compare the results to see which model is better for my business problem and which model would provide optimal results. 

In [3]:
df.columns.tolist()

['country',
 'year',
 'Life Ladder',
 'Log GDP per capita',
 'Social support',
 'Healthy life expectancy at birth',
 'Freedom to make life choices',
 'Generosity',
 'Perceptions of corruption',
 'Positive affect',
 'Negative affect',
 'Confidence in national government',
 'Democratic Quality',
 'Delivery Quality',
 'Standard deviation of ladder by country-year',
 'Standard deviation/Mean of ladder by country-year',
 'GINI index (World Bank estimate)',
 'GINI index (World Bank estimate), average 2000-15',
 'gini of household income reported in Gallup, by wp5-year']

In [4]:
df.shape

(1562, 19)

In [5]:
#To get a sense of how much preprocessing we have to do we can look at the number of null values
#in a feature. 
df.isnull().values.any()
nan_count = np.sum(df.isnull(), axis = 0) 
nan_count

country                                                       0
year                                                          0
Life Ladder                                                   0
Log GDP per capita                                           27
Social support                                               13
Healthy life expectancy at birth                              9
Freedom to make life choices                                 29
Generosity                                                   80
Perceptions of corruption                                    90
Positive affect                                              18
Negative affect                                              12
Confidence in national government                           161
Democratic Quality                                          171
Delivery Quality                                            171
Standard deviation of ladder by country-year                  0
Standard deviation/Mean of ladder by cou

In [6]:
'''One major thing to notice about this is features like 'GINI index (World Bank estimate)' and 
'gini of household income reported in Gallup, by wp5-yea' have too many null values and that too 
in a dataset that does not have too many examples, so the best thing to do for these features 
would be dropping them.'''
df.drop(columns=['GINI index (World Bank estimate)'], inplace=True)
df.drop(columns=['GINI index (World Bank estimate), average 2000-15'], inplace=True)
df.drop(columns=['gini of household income reported in Gallup, by wp5-year'], inplace=True)
df.shape

(1562, 16)

In [7]:
is_int_or_float = (df.dtypes == np.int64) | (df.dtypes == np.float64)
is_int_or_float

country                                              False
year                                                  True
Life Ladder                                           True
Log GDP per capita                                    True
Social support                                        True
Healthy life expectancy at birth                      True
Freedom to make life choices                          True
Generosity                                            True
Perceptions of corruption                             True
Positive affect                                       True
Negative affect                                       True
Confidence in national government                     True
Democratic Quality                                    True
Delivery Quality                                      True
Standard deviation of ladder by country-year          True
Standard deviation/Mean of ladder by country-year     True
dtype: bool

In [None]:
'''All the features that have the null values are int or float types therefore we can technically
replace all the null vals with feature mean.'''

In [8]:
has_null = df.columns[df.isnull().any()].tolist()
for column in has_null:
    column_mean = df[column].mean()
    df[column].fillna(column_mean, inplace=True)

In [10]:
#lets check if we have removed all the the null values 
df.isnull().any()


country                                              False
year                                                 False
Life Ladder                                          False
Log GDP per capita                                   False
Social support                                       False
Healthy life expectancy at birth                     False
Freedom to make life choices                         False
Generosity                                           False
Perceptions of corruption                            False
Positive affect                                      False
Negative affect                                      False
Confidence in national government                    False
Democratic Quality                                   False
Delivery Quality                                     False
Standard deviation of ladder by country-year         False
Standard deviation/Mean of ladder by country-year    False
dtype: bool

In [61]:
'''Now that we have pre-processed the data we can actually move on to creating our model'''
y = df['Perceptions of corruption']
X = df.drop(columns = ['Perceptions of corruption', 'country'], axis = 1)

In [62]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.20, random_state=1234)

In [63]:
print(X_train.shape)
print(X_test.shape)

(1249, 14)
(313, 14)


In [64]:
num_examples = len(X_train)

param_grid = {'n_neighbors': (np.linspace(2, np.sqrt(num_examples), num=10, dtype=int)).tolist()}

param_grid

{'n_neighbors': [2, 5, 9, 13, 16, 20, 24, 27, 31, 35]}

In [65]:
model = KNeighborsRegressor()

grid = GridSearchCV(model, param_grid, cv=5)

grid_search = grid.fit(X_train, y_train)

In [51]:
best_n = grid_search.best_estimator_.n_neighbors

In [52]:
model_best = KNeighborsRegressor(n_neighbors=best_n)
model_best.fit(X_train, y_train)
class_label_predictions = model_best.predict(X_test)
mse = mean_squared_error(y_test, class_label_predictions)
print("Mean Squared Error:", mse)
r2 = r2_score(y_test, class_label_predictions)
print("R2_score:", r2)

Mean Squared Error: 0.015130271399414771
R2_score: 0.4407402329444323


In [69]:
lr_model =  LinearRegression()

# 2. Fit the model to the training data below
lr_model.fit(X_train, y_train)

# 3.  Call predict() to use the fitted model to make predictions on the test data. Save the results to variable
# 'y_lr_pred'
y_lr_pred = lr_model.predict(X_test)

# 4: Compute the RMSE and R2 (on y_test and y_lr_pred) and save the results to lr_rmse and lr_r2
lr_rmse= mean_squared_error(y_test, y_lr_pred, squared=False)
lr_r2 = r2_score(y_test, y_lr_pred)


print('[LR] Root Mean Squared Error: {0}'.format(lr_rmse))
print('[LR] R2: {0}'.format(lr_r2))

[LR] Root Mean Squared Error: 0.12201210213114612
[LR] R2: 0.4497340011734152


In [None]:
''' The mean squared error is less than 1% for KNN which seems really good for training and R2 score is a 
little low we might want to re eavaluate the model to obtain even better scores for our model, also
look a little deeper on what could be causing such a low score for mean squared error. '''
'''LR model has a slightly higher R2 but it has a greater root mean squared error we have to
look which model would be better once we have more data.'''