# **Income Prediction**

**Objective :** Explore and implement **Principal Component Analysis** - a **Dimensionality Reduction** technique with Logistic Regression, to determine whether a person makes over 50K a year.

For this project we will be using the following UCI dataset- https://archive.ics.uci.edu/ml/datasets/Adult

Here are the features represented through columns :
<br>

**Input variables**
<br>
1 - age 
<br>
2 - workclass
<br>
3 - fnlwgt
<br>
4 - education 
<br>
5 - education-num
<br>
6 - marital-status
<br>
7 - occupation
<br>
8 - relationship
<br>
9 - race
<br>
10 - sex
<br>
11 - capital-gain
<br>
12 - capital-loss
<br>
13 - hours-per-week
<br>
14 - native-country
<br>



**Output/Target Variable**
<br>
15 - income
- (>)50K
- (<=)50K


## Table of Contents

-	Import Python libraries
-	Import dataset
-	Exploratory data analysis
-	Split data into training and test set
-	Feature engineering
-	Feature scaling
-	Logistic regression model with all features
-	Logistic Regression with PCA
-	Select right number of dimensions
-	Plot explained variance ratio with number of dimensions
-	Conclusion
	


## Import Python libraries

In [1]:
#import numpy
import numpy as np 

#import pandas
import pandas as pd

#import libraries for plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

## Import dataset

Use pandas to read adult.csv as a dataframe called adult

In [None]:
adult = pd.read_csv('adult.csv')

## Exploratory Data Analysis

### Check shape of dataset
<br>
Use .shape() method

In [None]:
adult.shape()

How many instances and attributes are present in the dataset?

### Preview Dataset
<br>
Use head() method

In [None]:
adult.head()

### View summary of dataframe
<br>
Use info() method

In [None]:
adult.info()

Summary of the dataset shows that there are no missing values. But the preview shows that the dataset contains values coded as `?`. So, we will encode `?` as NaN values.

### Encode `?` as `NaNs`

In [None]:
adult[adult == '?'] = np.nan

### Again check the summary of dataframe

In [None]:
adult.info()

Which variables contain missing values?
<br>
What is the datatype of these variables?
<br>
We will impute the missing values with the most frequent value - the mode.

### Impute missing values with mode

In [9]:
for col in ['workclass', 'occupation', 'native.country']:
    adult[col].fillna(adult[col].mode()[0], inplace=True)

### Check again for missing values

In [None]:
adult.isnull().sum()

Verify that there are no missing values in the dataset.

### Setting feature vector and target variable

In [11]:
#Set the 'income' column to y
y = df['income']

#Drop the 'income' column from the dataframe and set the remaining dataframe to X
X = df.drop(['income'], axis=1)

## Split data into separate training and test set

In [13]:
#Import train_test_split
from sklearn.model_selection import train_test_split

#Split the data set into training data and testing data in a 7:3 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

## Feature Engineering

### Encode categorical variables

In [14]:
#Import LabelEncoder from sklearn
from sklearn import preprocessing

#Create a list named 'categorical' of all the categorical features in X
categorical = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']

#Build a for loop that traverses through the list 'categorical' that you created above. 
#For each iteration of the loop,
#Create an instance of LabelEncoder named 'encoder'
#Use .fit_transform method to fit encoder to current feature in X_train and X_test
for feature in categorical:
        le = preprocessing.LabelEncoder()
        X_train[feature] = le.fit_transform(X_train[feature])
        X_test[feature] = le.transform(X_test[feature])

## Feature Scaling

In [15]:
#Import StandardScaler
from sklearn.preprocessing import StandardScaler

#Create an instance named 'scaler'
scaler = StandardScaler()

#Use .fit_transform method to scale ALL features in X_train and X_test
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns = X.columns)

## Logistic Regression model with all features

In [4]:
#Import LogisticRegression
#Import import accuracy_score from sklearn.metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#Create an instance of LogisticRegression() called logreg and fit it to the training data.
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

#Create predictions from the test set and name the result y_pred
y_pred = logreg.predict(X_test)

#print out the accuracy score for LogisticRegression
print('Logistic Regression accuracy score with all the features: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

## Logistic Regression with PCA

Scikit-Learn's PCA class implements PCA algorithm using the code below. Before diving deep, we will explain another important concept called explained variance ratio.


### Explained Variance Ratio

A very useful piece of information is the **explained variance ratio** of each principal component. It is available via the `explained_variance_ratio_ ` variable. It indicates the proportion of the dataset’s variance that lies along the axis of each principal component.

Now, let's get to the PCA implementation.


In [6]:
#Import PCA from sklearn.decomposition
from sklearn.decomposition import PCA

#Create an instance named 'pca'
pca = PCA()

#Use .fit_transform method to fit pca to X_train
X_train = pca.fit_transform(X_train)

#Use pca.explained_variance_ratio_ to find out feature-wise proportion of the dataset’s variance
pca.explained_variance_ratio_

**Observations**

- Approximately what % of variance is explained by the first 13 variables?

- How much variance is explained by the last variable? Can we assume that it carries little information? 

- Let's now drop it, train the model again and calculate the accuracy. 



### Logistic Regression with first 13 features

In [7]:
#Set the 'income' column to y
y = df['income']

#Drop the 'income' and 'native.country' columns from the dataframe and set the remaining dataframe to X
X = df.drop(['income','native.country'], axis=1)

#Split the data set into training data and testing data in a 7:3 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)


#Create a list named 'categorical' of all the categorical features in our newly created X
categorical = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex']


#Build a for loop that traverses through the list 'categorical' that you created above. 
#For each iteration of the loop,
#Create an instance of LabelEncoder named 'encoder'
#Use .fit_transform method to fit encoder to current feature in X_train and X_test
for feature in categorical:
        le = preprocessing.LabelEncoder()
        X_train[feature] = le.fit_transform(X_train[feature])
        X_test[feature] = le.transform(X_test[feature])

#Use .fit_transform method to scale ALL features in X_train and X_test
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)

X_test = pd.DataFrame(scaler.transform(X_test), columns = X.columns)

#Create an instance of LogisticRegression() called logreg and fit it to the training data.
logreg = LogisticRegression()
logreg.fit(X_train, y_train)


#Create predictions from the test set and name the result y_pred
y_pred = logreg.predict(X_test)

#print out the accuracy score for LogisticRegression
print('Logistic Regression accuracy score with the first 13 features: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

### Comment

- What is the change in accuracy of our model?

- Now, consider the last two features. Approximately what % of variance is explained by them, combined?

- Let's drop them both, train the model again and calculate the accuracy.


### Logistic Regression with first 12 features

In [8]:
#Set the 'income' column to y
y = df['income']

#Drop the 'income','native.country', 'hours.per.week' columns from the dataframe and set the remaining dataframe to X
X = df.drop(['income','native.country', 'hours.per.week'], axis=1)

#Split the data set into training data and testing data in a 7:3 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)


#Create a list named 'categorical' of all the categorical features in our newly created X
categorical = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex']


#Build a for loop that traverses through the list 'categorical' that you created above. 
#For each iteration of the loop, 
#Create an instance of LabelEncoder named 'encoder'
#Use .fit_transform method to fit encoder to current feature in X_train and X_test
for feature in categorical:
        le = preprocessing.LabelEncoder()
        X_train[feature] = le.fit_transform(X_train[feature])
        X_test[feature] = le.transform(X_test[feature])


#Use .fit_transform method to scale ALL features in X_train and X_test
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)

X_test = pd.DataFrame(scaler.transform(X_test), columns = X.columns)

#Create an instance of LogisticRegression() called logreg and fit it to the training data.
logreg = LogisticRegression()
logreg.fit(X_train, y_train)


#Create predictions from the test set and name the result y_pred
y_pred = logreg.predict(X_test)


#print out the accuracy score for LogisticRegression
print('Logistic Regression accuracy score with the first 12 features: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

### Comment

- What is the change in accuracy of our model, if it is trained with 12 features?

- Lastly, we will take the last three features combined. Approximately what % of variance is explained by them?

- Let's repeat the process, drop these features, train the model again and calculate the accuracy.


### Logistic Regression with first 11 features

In [9]:
#Set the 'income' column to y
y = df['income']

#Drop the 'income','native.country','hours.per.week','capital.loss' columns from the dataframe and set the remaining dataframe to X
X = df.drop(['income','native.country', 'hours.per.week', 'capital.loss'], axis=1)

#Split the data set into training data and testing data in a 7:3 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)


#Create a list named 'categorical' of all the categorical features in our newly created X
categorical = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex']


#Build a for loop that traverses through the list 'categorical' that you created above. 
#For each iteration of the loop, 
#Create an instance of LabelEncoder named 'encoder'
#Use .fit_transform method to fit encoder to current feature in X_train and X_test
for feature in categorical:
        le = preprocessing.LabelEncoder()
        X_train[feature] = le.fit_transform(X_train[feature])
        X_test[feature] = le.transform(X_test[feature])

#Use .fit_transform method to scale ALL features in X_train and X_test
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)

X_test = pd.DataFrame(scaler.transform(X_test), columns = X.columns)

#Create an instance of LogisticRegression() called logreg and fit it to the training data.
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

#Create predictions from the test set and name the result y_pred
y_pred = logreg.predict(X_test)


#print out the accuracy score for LogisticRegression
print('Logistic Regression accuracy score with the first 11 features: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))


### Comment

- Has the accuracy increased or decreased if we drop the last three features?

- Our aim is to maximize the accuracy. When did we get the highest accuracy?

## Select right number of dimensions

- The above process works well if the number of dimensions are small.

- But, it is quite cumbersome if we have large number of dimensions.

- In that case, a better approach is to compute the number of dimensions that can explain significantly large portion of the variance.

- The following code computes PCA without reducing dimensionality, then computes the minimum number of dimensions required to preserve 90% of the training set variance.

In [5]:
#Set the 'income' column to y
y = df['income']

#Drop the 'income' column from the dataframe and set the remaining dataframe to X
X = df.drop(['income'], axis=1)

#Split the data set into training data and testing data in a 7:3 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)


#Create a list named 'categorical' of all the categorical features in X
categorical = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']


#Build a for loop that traverses through the list 'categorical' that you created above. 
#For each iteration of the loop,
#Create an instance of LabelEncoder named 'encoder'
#Use .fit_transform method to fit encoder to current feature in X_train 
for feature in categorical:
        le = preprocessing.LabelEncoder()
        X_train[feature] = le.fit_transform(X_train[feature])
        X_test[feature] = le.transform(X_test[feature])

#Use .fit_transform method to scale ALL features in X_train
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)

#Create an instance of PCA named 'pca'
pca= PCA()

#Use .fit_transform method to fit pca to X_train
pca.fit(X_train)

cumsum = np.cumsum(pca.explained_variance_ratio_)
dim = np.argmax(cumsum >= 0.90) + 1
print('The number of dimensions required to preserve 90% of variance is',dim)

### Comment

- With the required number of dimensions found, we can then set number of dimensions to `dim` and run PCA again.

- With the number of dimensions set to `dim`, we can then calculate the required accuracy.

## Plot explained variance ratio with number of dimensions

- An alternative option is to plot the explained variance as a function of the number of dimensions.

- In the plot, we should look for an elbow where the explained variance stops growing fast.

- This can be thought of as the intrinsic dimensionality of the dataset.

- Now, we will plot cumulative explained variance ratio with number of components to show how variance ratio varies with number of components.

In [None]:
plt.figure(figsize=(8,6))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlim(0,14,1)
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.show()

### Comment

According to the above plot, how many components explain around 90% of variance?

## Conclusion

-	In this project, we discussed Principal Component Analysis – the most popular dimensionality reduction technique.
-	We demonstrated PCA implementation with Logistic Regression on the adult dataset.
-	Maximum accuracy was first found through a manual feature selection process. 
-	As expected, the number of dimensions required to preserve 90 % of variance matched.
-	Finally, we plotted the explained variance ratio with number of dimensions. The graph confirmed our findings.
