<a href="https://colab.research.google.com/github/Valerie-Osawe/predict_salary/blob/main/Assignment_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Predict A Person's Salary



### Data Dictionary

```
- Age: continuous.

- Workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

- Final_weight: continuous.

- Education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, - Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

- Education_num: continuous.

- Marital_status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, - Married-AF-spouse.

- Occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

- Relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

- Race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

- Genger: Female, Male.

- Capital_gain: continuous.

- Capital_loss: continuous.

- Hours_per_week: continuous.

- Country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

- Salary: 
```

### Objective

```
Predict whether a person makes over 50K a year.

```

In [None]:
# Built-in library
import itertools

# Standard imports
import numpy as np
import pandas as pd


# pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 1_000


# Black code formatter
%load_ext lab_black

In [None]:
# Load the dataset
salary = pd.read_csv("salary.csv")

# Check the first few rows of the dataset
salary.head()

Unnamed: 0,Age,Workclass,Final_weight,Education,Education_num,Marital_status,Occupation,Relationship,Race,Sex,Capital_gain,Capital_loss,Hours_per_week,Country,Salary
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


### Exploring data

In [None]:
# Check the shape of the dataset
salary.shape

(32560, 15)

In [None]:
# Check for missing values
salary.isnull().sum()

Age               0
Workclass         0
Final_weight      0
Education         0
Education_num     0
Marital_status    0
Occupation        0
Relationship      0
Race              0
Sex               0
Capital_gain      0
Capital_loss      0
Hours_per_week    0
Country           0
Salary            0
dtype: int64

In [None]:
salary.Country.value_counts()

 United-States                 29169
 Mexico                          643
 ?                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 

In [None]:
salary = salary.query("Country != ' ?'")

In [None]:
# Check the distribution of the target variable
salary["Salary"].value_counts()

 <=50K    24282
 >50K      7695
Name: Salary, dtype: int64

### Preprocessing data

In [None]:
from sklearn.preprocessing import LabelEncoder

cat_vars = [
    "Workclass",
    "Education",
    "Marital_status",
    "Occupation",
    "Relationship",
    "Race",
    "Sex",
    "Country",
]
# Create a LabelEncoder object
le = LabelEncoder()

# Apply label encoding to the categorical variables using the apply() method
salary[cat_vars] = salary[cat_vars].apply(lambda x: le.fit_transform(x))
salary.head()

Unnamed: 0,Age,Workclass,Final_weight,Education,Education_num,Marital_status,Occupation,Relationship,Race,Sex,Capital_gain,Capital_loss,Hours_per_week,Country,Salary
0,50,6,83311,9,13,2,4,0,4,1,0,0,13,38,<=50K
1,38,4,215646,11,9,0,6,1,4,1,0,0,40,38,<=50K
2,53,4,234721,1,7,2,6,0,2,1,0,0,40,38,<=50K
3,28,4,338409,9,13,2,10,5,2,0,0,0,40,4,<=50K
4,37,4,284582,12,14,2,4,5,4,0,0,0,40,38,<=50K


In [None]:
# Scale the numerical variables
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
salary[
    [
        "Age",
        "Final_weight",
        "Education_num",
        "Capital_gain",
        "Capital_loss",
        "Hours_per_week",
    ]
] = scaler.fit_transform(
    salary[
        [
            "Age",
            "Final_weight",
            "Education_num",
            "Capital_gain",
            "Capital_loss",
            "Hours_per_week",
        ]
    ]
)
salary.head()

Unnamed: 0,Age,Workclass,Final_weight,Education,Education_num,Marital_status,Occupation,Relationship,Race,Sex,Capital_gain,Capital_loss,Hours_per_week,Country,Salary
0,0.835962,6,-1.00646,9,1.143809,2,4,0,4,1,-0.145826,-0.215994,-2.220918,38,<=50K
1,-0.042381,4,0.245246,11,-0.418316,0,6,1,4,1,-0.145826,-0.215994,-0.033848,38,<=50K
2,1.055548,4,0.42567,1,-1.199378,2,6,0,2,1,-0.145826,-0.215994,-0.033848,38,<=50K
3,-0.774334,4,1.406415,9,1.143809,2,10,5,2,0,-0.145826,-0.215994,-0.033848,4,<=50K
4,-0.115576,4,0.897286,12,1.534341,2,4,5,4,0,-0.145826,-0.215994,-0.033848,38,<=50K


### Train model

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into features (X) and target (y)
X = salary.drop(columns=["Salary"])
y = salary["Salary"]

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [None]:
from sklearn.linear_model import LogisticRegression

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

### Evaluate Model

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the test set
predict = model.predict(X_test)

# Calculate the accuracy
print("Accuracy: ", accuracy_score(y_test, predict))

Accuracy:  0.8256722951844903


In [None]:
# Calculate the precision
print("Precision: ", precision_score(y_test, predict, pos_label=" >50K"))

Precision:  0.6830031282586028


In [None]:
# Calculate the recall score
print("Recall: ", recall_score(y_test, predict, pos_label=" >50K"))

# Calculate the f1 score
print("F1-Score: ", f1_score(y_test, predict, pos_label=" >50K"))

Recall:  0.44679399727148705
F1-Score:  0.5402061855670103


The accuracy, precision, recall, and F1-score values of the model indicates that the model has performed moderately well. The accuracy of the model is 82.57%, which means that the model is able to predict the correct class label for 82.57% of the instances in the test dataset.

The precision of the model is 68.30%, which means that out of all the instances that the model has predicted as positive, 68.30% of them are actually positive.

The recall of the model is 44.68%, which means that out of all the actual positive instances in the test dataset, the model is able to correctly identify 44.68% of them.

The F1-score of the model is 54.02%, which is the harmonic mean of precision and recall. It is a balanced measure that takes both precision and recall into account.

Based on the metrics, the model is doing reasonably well but there is still room for improvement.

### Predict Salary

In [None]:
# Predict the salary of a person
new_candidate = pd.DataFrame(
    {
        "Age": [30],
        "Workclass": [3],
        "Final_weigh": [0.360141],
        "Education": [5],
        "Education_num": [1.134779],
        "Marital_status": [6],
        "Occupation": [10],
        "Relationship": [0],
        "Race": ["2"],
        "Sex": [0],
        "Capital-gain": [0],
        "Capital-loss": [0],
        "Hours-per-week": [40],
        "Country": [39],
    }
)
new_salary = model.predict(new_candidate)
if new_salary == " >50K":
    print("The person makes over 50K a year")
else:
    print("The person doesn't make over 50K a year")

The person makes over 50K a year
