# **How to start the lab:**

1.	Download the python code from Brightspace.
2.	Open the code in Google Colab.
3.	Download the dataset and review the dataset.
4.	Tasks are given with in the Python program.
5.	Complete each task.
6.	Finally, submit the file in Brightspace. (Please check Lab3 lab file to know how to submit)


<a id='intro'></a>
# **# Case Study: Income classification/Prediction Using Supervise Learning**

# **You will start from the basics.**

**The aim of the project** is to employ several supervised algorithms to accurately model individuals' income, whether he makes more than 50,000 or not, using data collected from the 1994 U.S. Census.

The dataset that will be used is the **Census income dataset**, which was extracted from the machine learning repository (UCI), which contains about 32561 rows and 15 features. This dataset is large enough to understand supervised learning methods.

You can download dataset using the UCI website as given below:

https://archive.ics.uci.edu/ml/datasets/Census+Income

# **The tasks are as follows:**

1. Open UCI website then open Data Folder.
2. Download "adult.data". This is the dataset we will process.This is raw data and we will not be able to process it in its current form. Therefore, we need to complete the next task.
3. Perform data cleaning.
4. Perform other tasks one by one.
5. Prepare a report for your own record (report can be utilised to assignment preparation.)
6. I highly recommend preparing report.

In [None]:
import zipfile

# import needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

%matplotlib inline

# **Task**: Read the dataset. You can download dataset using the UCI website as given below:

https://archive.ics.uci.edu/ml/datasets/Census+Income

In [None]:
import requests
import io
import zipfile
url = 'https://archive.ics.uci.edu/static/public/20/census+income.zip'
with open('census_income.zip', 'wb') as f:
    f.write(requests.get(url).content)

column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

with zipfile.ZipFile('census_income.zip', 'r') as zip_ref:
    with zip_ref.open('adult.data') as file:
        cens = pd.read_csv(file, names=column_names, header=None, na_values='?')

The project poses some questions about this experiment:

**1. Does an individual make more than 50k income or not?**

**2. What are the most important features that help to define the income of an individual?**

<a id='overview'></a>
# Overview

In [None]:
# take an overview look at the data
cens.head()

# Task: Display first 20 instances of the dataset.

In [None]:
#Write code here
cens.head(20) # Display the 20 instances of the dataset

In [None]:
cens.info()

In [None]:
# Total number of records
n_records = cens.shape[0]

# Total number of features
n_features = cens.shape[1]

# Number of records where individual's income is more than $50,000
n_greater_50k = cens[cens['income'] == ' <=50K'].shape[0]

# Number of records where individual's income is at most $50,000
n_at_most_50k = cens[cens['income'] == ' >50K'].shape[0]

# Percentage of individuals whose income is more than $50,000
greater_percent =  (n_greater_50k / n_records) * 100

# Print the results
print("Total number of records: {}".format(n_records))
print("Total number of features: {}".format(n_features))
print("Individuals making more than $50k: {}".format(n_greater_50k))
print("Individuals making at most $50k: {}".format(n_at_most_50k))
print("Percentage of individuals making more than $50k: {:.2f}%".format(greater_percent))

<a id='clean'></a>
# Data Cleaning

In [None]:
# drop unneeded columns
cens.drop('education', inplace=True, axis=1)
cens.columns.tolist()

- We have dropped the education feature. Which is a duplicate feature of education_num, but in a nonnumerical format.

The matching education level of the education number:

**1**: Preschool, **2**: 1st-4th, **3**: 5th-6th, **4**: 7th-8th, **5**: 9th, **6**: 10th, **7**: 11th, **8**: 12th, **9**: HS-grad,

**10**: Some-college, **11**: Assoc-voc, **12**: Assoc-acdm, **13**: Bachelors, **14**: Masters, **15**: Prof-school, **16**: Doctorate

In [None]:
# check for nulls
cens.isna().sum()

- It appears that there are no null values occurred in the dataset.

In [None]:
# check duplicates and remove it
print("Before removing duplicates:", cens.duplicated().sum())

cens = cens[~cens.duplicated()]

print("After removing duplicates:", cens.duplicated().sum())

- There are 24 duplicate rows in our dataset. So, we remove them to make the data more realistic and free-error.

In [None]:
# before discarding
cens['sex'].value_counts()

In [None]:
# discard spaces from entries
columns = ['workclass', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'income']
# for column in columns:
#     cens[column] = cens[column].str.strip()
# Same as above but with list comprehension from CHATGPT
cens[[col for col in columns]] = cens[[col for col in columns]].apply(lambda x: x.str.strip())
cens.head()

In [None]:
# after discarding
cens.sex.value_counts()

- Discarding the spaces from the entries of the dataset, for easier access.

In [None]:
# before changing "?"
cens.workclass.value_counts()

In [None]:
# changing "?" to Unknown
change_columns = ['workclass', 'occupation', 'native_country']
# for column in change_columns:
#         cens[column] = cens[column].replace({'?': 'Unknown'})

# Again same with above but my code.
cens[[col for col in columns]] = cens[[col for col in columns]].apply(lambda x: x.replace({'?':'Unknown'}))

In [None]:
# after changing "?"
cens.workclass.value_counts()

In [None]:
# Looking through the dataset again
cens.head()

- Changing "?" symbol to "Unknown", for better interpretation and cleaner representation.

<a id='explore'></a>
# Data Exploration

In [None]:
# a quick look on some statistics about the data
cens.describe()

In [None]:
ct_counts = cens.groupby(['education_num', 'income']).agg(count=('income', 'size'))
ct_counts = ct_counts.reset_index()
ct_counts
# ct_counts = cens.groupby(['education_num', 'income']).size()
# ct_counts.head()
# ct_transform = cens.groupby(['education_num']).transform(lambda x: x.nunique())
# ct_transform

In [None]:
ct_counts = ct_counts.pivot(index='education_num', columns='income', values='count').fillna(0)
ct_counts

In [None]:
import seaborn as sns
plt.figure(figsize=(10, 10))
sns.heatmap(ct_counts, annot=True, fmt='.0f', cbar_kws={'label': 'Number of Individuals'})
plt.title('Number of People for Education class relative to column')
plt.xlabel('Income')
plt.ylabel('Education class');

In [None]:
# Heat map
plt.figure(figsize=(10,10))

ct_counts = cens.groupby(['education_num', 'income']).size()
ct_counts = ct_counts.reset_index(name = 'count')
ct_counts = ct_counts.pivot(index = 'education_num', columns = 'income', values = 'count').fillna(0)

sb.heatmap(ct_counts, annot = True, fmt = '.0f', cbar_kws = {'label' : 'Number of Individuals'})
plt.title('Number of People for Education Class relative to Income')
plt.xlabel('Income ($)')
plt.ylabel('Education Class');

- In the graph above, we can see that people with education classes of 9 & 10 make up the highest portion in the dataset. Also, we notice that people with education class of 14 to 16 proportionally usually make >50k as income in the statistics we have in the dataset, unlike lesser education classes where they usually make <=50k as income.

In [None]:
# Clustered Bar Chart
plt.figure(figsize=(8,6))
ax = sb.barplot(data = cens, x = 'income', y = 'age', hue = 'sex')
ax.legend(loc = 8, ncol = 3, framealpha = 1, title = 'Sex')
plt.title('Average of Age for Sex relative to Income')
plt.xlabel('Income ($)')
plt.ylabel('Average of Age');

- The figure shows in general that the people with >50K has a higher average age than the ones with <=50K. And in both cases of income, we see that the male category has a little bit greater age average than the female category.

In [None]:
# Bar Chart
plt.figure(figsize=(8,6))
sb.barplot(data=cens, x='income', y='hours_per_week', palette='YlGnBu')
plt.title('Average of Hours per Week relative to Income')
plt.xlabel('Income ($)')
plt.ylabel('Average of Hours per Week');

- We notice here that the income grows directly with the average of work hours per week, which is a pretty reasonable and logical result.

<a id='preprocess'></a>
# Data Preprocessing

In [None]:
cens_prep = cens.copy()

- We have taken a copy of the dataset to maintain the cleaned one for later uses, and to use the copied one for preparing the data for the model.

In [None]:
# Scaling
from sklearn.preprocessing import MinMaxScaler
numerical = ['age', 'capital_gain', 'capital_loss', 'hours_per_week', 'fnlwgt']

scaler = MinMaxScaler()
cens_prep[numerical] = scaler.fit_transform(cens_prep[numerical])

In [None]:
cens_prep.sample(3)

- The data has been scaled to MinMaxScalling for numerical features, which converts the data to have a range between 0 and 1. That would help to make the data well-prepared for the model.

$$
X_{\text{scaled}} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}
$$

In [None]:
# Encoding
cens_prep['sex'] = cens_prep.sex.replace({"Female": 0, "Male": 1})
cens_prep['income'] = cens_prep.income.replace({"<=50K": 0, ">50K": 1})

# Create dummy variables
cens_prep = pd.get_dummies(cens_prep)

In [None]:
encoded = list(cens_prep.columns)
print("{} total features after one-hot encoding.".format(len(encoded)))

- We have encoded and created dummy variables using the hot-encoding approach for the categorical features, to make it as numerical data. It helps for easier processing and more numerical representation.

<a id='exp'></a>
# Experimental Process

In the project, the **independent variables** have been chosen as follows:
1. **Age**
2. **Workclass**
3. **Fnlwgt**
4. **Education_num**
5. **Marital_status**
6. **Occupation**
7. **Relationship**
8. **Race**
9. **Sex**
10. **Capital_gain**
11. **Capital_loss**
12. **Hours_per_week**
13. **Native_country**

Also, the **Income** variable is considered to be the **dependent variable**, since it is our concern in this experiment.

# **Task: Analyse the code and import the libraries required to run this program**

In [None]:
# import some classification models
#Task 1: Import libraries for Random forest classifier, AdaBoost Classifier, Logistic Regression Classifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression

# import needed functions
#Task 2: Import libraries for cross validation, accuracy score, F1 score and splitting dataset.
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate

import warnings
warnings.filterwarnings("ignore")

# **Task: Write code for splitting to training and testing. Training data = 80% and apply random split**

In [None]:
# Portioning the data
X = cens_prep.drop('income', axis=1)
y = cens_prep['income']

# Task-3: Write code for splitting to training and testing. Training data = 80% and apply random split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
models = {}

# models with default parameter
models['LR'] = LogisticRegression() # LG: Logic Rrednwi
models['RandomForest'] = RandomForestClassifier() #modleik4gi34r
models['AdaBoost'] = AdaBoostClassifier() #ejfreuifhg


# Task: Write a short description on:
1. Cross validation
2. F1 Score


In [None]:
# Cross validation
for model_name in models:   # a loop to choose the model defined in the above code
    model = models[model_name]
    results = cross_validate(model, X, y, cv=5, scoring=['accuracy', 'f1'], return_train_score=True)

    print(model_name + ":")
    print("Accuracy:" , 'train: ', results['train_accuracy'].mean(), '| test: ', results['test_accuracy'].mean())
    print("F1-score:" , 'train: ', results['train_f1'].mean(), '| test: ', results['test_f1'].mean())
    print("---------------------------------------------------------")

- As it appears from the exploration in our dataset that there is an imbalance between the classes of classifications. Since the individuals making more than 50k as income represent 75% of the data. So, we would try to make oversampling.

In [None]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
Xr_train, Xr_test, yr_train, yr_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

In [None]:
clf = RandomForestClassifier()

results = cross_validate(clf, X_resampled, y_resampled, cv=5, scoring=['accuracy', 'f1'], return_train_score=True)
print("Accuracy:" , 'train: ', results['train_accuracy'].mean(), '| test: ', results['test_accuracy'].mean())
print("F1-score:" , 'train: ', results['train_f1'].mean(), '| test: ', results['test_f1'].mean())

In [None]:
# AdaBoost classifier
ada = AdaBoostClassifier()
results = cross_validate(ada, X_resampled, y_resampled, cv=5, scoring=['accuracy', 'f1'], return_train_score=True)
print(f'Accuracy: {results['train_accuracy'].mean()}, test accuracy: {results['test_accuracy'].mean()}')
print(f'F1-score: {results['train_f1'].mean()}, test f1: {results['test_f1'].mean()}')


In [None]:
lgr = LogisticRegression()
results = cross_validate(lgr, X_resampled, y_resampled, cv=5, scoring=['accuracy', 'f1'], return_train_score=True)
print(f'Accuracy: {results['train_accuracy'].mean()}, test accuracy: {results['test_accuracy'].mean()}')
print(f'F1-score: {results['train_f1'].mean()}, test f1: {results['test_f1'].mean()}')

### Models Definitions:
**Logistic regression**, despite its name, is a linear model for classification rather than regression. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

**A Random forest** is a meta estimator that fits several decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

**An AdaBoost classifier** is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

### Evaluation Methodology:
The data has been split into training and testing parts of the features and the label with a test size of 20% and with a random state to get the same randomness with the next runs. This happened by using the train_test_split function.

Cross-validation has been applied between the models to select the most suitable ones, We have done that using the cross_validate function with 5 folds splitting. And outputs the train and test score of the model.

All of This has been done that with a cleaned state of the data, also with the scaled and encoded version of it.

Due to the fact that there is an imbalance in the classes of classification. If this has been fixed, it would help the model to learn better from the various classes and to not be biassed towards one over another.

One way to fight this issue is to generate new samples in the classes which are under-represented (minority class). The most naive strategy is to generate new samples by randomly sampling with replacement of the currently available samples. This is called Oversampling, it is a technique used to modify unequal data classes to create balanced data sets.

And that has been applied using RandomOverSampler class from the imblearn library, to generate the new resampled data.

### Metrics used for Evaluation:

We have used the accuracy metric for the evaluation of the models. We can describe the accuracy metric as the ratio between the number of correct predictions and the total number of predictions:

$$\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$

Also, for binary classification, accuracy can also be calculated in terms of the confusion matrix terminology:

$$\text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN}$$

Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.


Also, F1-score has been used as one of the metrics in the experiment, it can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal.

The formula for the F1 score is:

$$\text{F1} = \frac{2 * (precision * recall)}{(precision + recall)}$$

<a id='conclude'></a>
# Conclusions

## Features Importance

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
cens_conc = cens.copy()

In [None]:
for col in cens_conc.columns:
    if cens_conc[col].dtypes == 'object':
        encoder = LabelEncoder()
        cens_conc[col] = encoder.fit_transform(cens_conc[col])

In [None]:
# Portioning the data
Xc = cens_conc.drop('income', axis=1)
yc = cens_conc['income']

# Splitting to training and testing
Xc_train, Xc_test, yc_train, yc_test = train_test_split(Xc, yc, test_size=0.2, random_state=42)

In [None]:
rfc = RandomForestClassifier()
rfc.fit(Xc_train, yc_train)


# View a list of the features and their importance scores
print('\nFeatures Importance:')
feat_imp = pd.DataFrame(zip(Xc.columns.tolist(), rfc.feature_importances_ * 100), columns=['feature', 'importance'])
feat_imp

# Task: Why one should perform 'Feature Importance'?

Feature Importance is a way to measure how a variable contributes to a machine learning model's prediction, Feature importance works using techniques like decision trees,
neural networks.


In [None]:
# Features importance plot
plt.figure(figsize=(20,6))
sb.barplot(data=feat_imp, x='feature', y='importance')
plt.title('Features Importance', weight='bold', fontsize=20)
plt.xlabel('Feature', weight='bold', fontsize=13)
plt.ylabel('Importance (%)', weight='bold', fontsize=13);


# add annotations
impo = feat_imp['importance']
locs, labels = plt.xticks()

for loc, label in zip(locs, labels):
    count = impo[loc]
    pct_string = '{:0.2f}%'.format(count)

    plt.text(loc, count-0.8, pct_string, ha = 'center', color = 'w', weight='bold')

- We plan to drop the features that have less than 4% impartance, to speed up the process of fitting the model. Since without them, it would provide the same results of the evaluation.

## Feature Selection

In [None]:
cens_final = cens.copy()

In [None]:
cens_final.head(2)

In [None]:
cens_final.drop(['race', 'sex', 'capital_loss', 'native_country'], axis=1, inplace=True)

In [None]:
# Scaling
numerical = ['age', 'capital_gain', 'hours_per_week', 'fnlwgt']
scaler = MinMaxScaler()
cens_final[numerical] = scaler.fit_transform(cens_final[numerical])

# Encoding
cens_final['income'] = cens_final.income.replace({"<=50K": 0, ">50K": 1})

# Create dummy variables
cens_final = pd.get_dummies(cens_final)

# Portioning
Xf = cens_final.drop('income', axis=1)
yf = cens_final['income']

# Oversampling
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(Xf, yf)

In [None]:
rf2 = RandomForestClassifier()

results = cross_validate(rf2, X_resampled, y_resampled, cv=5, scoring=['accuracy', 'f1'], return_train_score=True)
print("Accuracy:" , 'train: ', results['train_accuracy'].mean(), '| test: ', results['test_accuracy'].mean())
print("F1-score:" , 'train: ', results['train_f1'].mean(), '| test: ', results['test_f1'].mean())

# Task: Analyse the code of random forset classifier and write your code for other two classifiers that adaboost and logistic regression.



In [None]:
clf = RandomForestClassifier()

results = cross_validate(clf, X_resampled, y_resampled, cv=5, scoring=['accuracy', 'f1'], return_train_score=True)
print("Accuracy:" , 'train: ', results['train_accuracy'].mean(), '| test: ', results['test_accuracy'].mean())
print("F1-score:" , 'train: ', results['train_f1'].mean(), '| test: ', results['test_f1'].mean())

In [None]:
lgr = LogisticRegression()
results = cross_validate(lgr, X_resampled, y_resampled, cv=5, scoring=['accuracy', 'f1'], return_train_score=True)
print(f'Accuracy: {results['train_accuracy'].mean()}, test accuracy: {results['test_accuracy'].mean()}')
print(f'F1-score: {results['train_f1'].mean()}, test f1: {results['test_f1'].mean()}')

# Challenging Task: Try to implement to solve this problem using SVM and KNN.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Models definition
knn = KNeighborsClassifier()
svm = SVC(gamma='auto')

# KNN model
results = cross_validate(knn, X_resampled, y_resampled, cv=5, scoring=['accuracy', 'f1'], return_train_score=True)
print(f'------------- KNN Classifier -----------------')
print(f'Accuracy: {results['train_accuracy'].mean()}, test accuracy: {results['test_accuracy'].mean()}')
print(f'F1-score: {results['train_f1'].mean()}, test f1: {results['test_f1'].mean()}')

# SVM model
results = cross_validate(svm, X_resampled, y_resampled, cv=5, scoring=['accuracy', 'f1'], return_train_score=True)
print(f'------------- SVM Classifier -----------------')
print(f'Accuracy: {results['train_accuracy'].mean()}, test accuracy: {results['test_accuracy'].mean()}')
print(f'F1-score: {results['train_f1'].mean()}, test f1: {results['test_f1'].mean()}')
