<a href="https://colab.research.google.com/github/davegbade/Project-2-Machine-Learning.ipynb/blob/main/Project_2_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project 2 Part 1**



In [1]:
# Pandas
import pandas as pd
# Numpy
import numpy as np
# MatplotLib
import matplotlib.pyplot as plt

# Preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer

# Models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression

# Classification Metrics

from sklearn.metrics import (roc_auc_score, ConfusionMatrixDisplay,
                             PrecisionRecallDisplay, RocCurveDisplay,
                             f1_score, accuracy_score, precision_score,
                             recall_score, classification_report)

# Set global scikit-learn configuration
from sklearn import set_config
# Display estimators as a diagram
set_config(display='diagram') # 'text' or 'diagram'}

# **Load and inspect the data**

# **First choice: dataset 1**

# **Stroke Prediction Dataset**





**Brief description of the dataset:**

- The dataset is about stroke prediction, which is a serious health issue that affects millions of people worldwide.

- The dataset contains information about 12 variables for each patient, such as gender, age, hypertension, heart disease, ever married, work type, residence
type, average glucose level, body mass index, smoking status and stroke outcome.

- The dataset can be used to train a machine learning model to predict whether a patient is likely to get a stroke based on the input parameters.

- The dataset is from a confidential source and should be used only for educational purposes with proper citation.

**Suggestions for the models appropriate for the dataset:**

stroke prediction dataset, a classification model would be suitable, since the target variable is binary (stroke or no stroke). Some possible classification models are logistic regression, decision tree, random forest, support vector machine, k-nearest neighbors and neural network

In [2]:
# Load the data
#f_path ="/content/drive/MyDrive/healthcare-dataset-stroke-data.csv"
#df = pd.read_csv(f_path)
#df.head()

# **Second choice: dataset 2**

# **Adult income dataset**

**Brief description of the dataset:**

- The dataset is about adult income, which is influenced by various factors such as education level, age, gender, occupation and etc.

- The dataset contains information about 14 variables for each individual, such as age, work class, education, marital status, occupation, relationship, race, sex, capital gain, capital loss, hours per week, native country and income level.

- The dataset can be used to train a machine learning model to predict whether an individual’s income is above or below 50K based on the input parameters.

- The dataset is from the UCI machine learning repository and has been widely cited in the literature.

**Suggestions for the models appropriate for the dataset:**

For the adult income dataset, a classification model would also be suitable, since the target variable is binary (income above or below 50K). Some possible classification models are the logistic regression, decision tree, random forest, support vector machine, k-nearest neighbors and neural network , or you can also try naive Bayes, gradient boosting or XGBoost.

In [4]:
# Load the data
f_path2 ="/content/drive/MyDrive/Adult Income.csv"
df2 = pd.read_csv(f_path2)
df2.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [7]:
# Display the number of rows and columns for the dataframe
#df.shape
#print(f'There are {df.shape[0]} rows, and {df.shape[1]} columns.')
#print(f'The rows represent {df.shape[0]} observations, and the columns represent {df.shape[1]-1} features and 1 target variable.')

In [8]:
# Display the number of rows and columns for the dataframe
df2.shape
print(f'There are {df2.shape[0]} rows, and {df2.shape[1]} columns.')
print(f'The rows represent {df2.shape[0]} observations, and the columns represent {df2.shape[1]-1} features and 1 target variable.')

There are 48842 rows, and 15 columns.
The rows represent 48842 observations, and the columns represent 14 features and 1 target variable.


In [9]:
# Display the column names, count of non-null values, and their datatypes
#df.info()

In [10]:
# Display the column names, count of non-null values, and their datatypes
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


# **Clean the Data**

**Remove Unnecessary Columns**

In [11]:
#df.drop(columns=['id'], inplace=True)
#df.columns

In [12]:
df2.drop(columns=['fnlwgt'], inplace=True)
df2.columns

Index(['age', 'workclass', 'education', 'educational-num', 'marital-status',
       'occupation', 'relationship', 'race', 'gender', 'capital-gain',
       'capital-loss', 'hours-per-week', 'native-country', 'income'],
      dtype='object')

**Duplicates**

In [13]:
# Display the number of duplicate rows in the dataset
#print(f'There are {df.duplicated().sum()} duplicate rows.')

In [14]:
# Display the number of duplicate rows in the dataset
#print(f'There are {df2.duplicated().sum()} duplicate rows.')

In [15]:
# Drop duplicte rows
df2.drop_duplicates(inplace=True)

In [16]:
# Display the number of duplicate rows in the dataset
print(f'There are {df2.duplicated().sum()} duplicate rows.')

There are 0 duplicate rows.


**Missing Values**

In [17]:
# Display the total number of missing values
#print(f'There are {df.isna().sum().sum()} missing values.')

In [18]:
# Display the total number of missing values
print(f'There are {df2.isna().sum().sum()} missing values.')

There are 0 missing values.


In [19]:
# Check for missing values. You should produce an output that shows the number of missing values for each feature.
#df.isna().sum()

We will not need to use SimpleImputer and in our preprocessing steps to impute missing values.

In [20]:
# Check for data types for each column
#df.info()

In [21]:
# Check for data types for each column
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42468 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              42468 non-null  int64 
 1   workclass        42468 non-null  object
 2   education        42468 non-null  object
 3   educational-num  42468 non-null  int64 
 4   marital-status   42468 non-null  object
 5   occupation       42468 non-null  object
 6   relationship     42468 non-null  object
 7   race             42468 non-null  object
 8   gender           42468 non-null  object
 9   capital-gain     42468 non-null  int64 
 10  capital-loss     42468 non-null  int64 
 11  hours-per-week   42468 non-null  int64 
 12  native-country   42468 non-null  object
 13  income           42468 non-null  object
dtypes: int64(5), object(9)
memory usage: 4.9+ MB


# **Fixing the inconsistence values**

In [23]:
#data_types = df.dtypes
#str_cols = data_types[data_types=='object'].index
#str_cols

In [24]:
#for col in str_cols:
   # print(f'- {col}:')
    #print(df[col].value_counts(dropna=False))
   # print("\n\n")

In [25]:
data_types = df2.dtypes
str_cols = data_types[data_types=='object'].index
str_cols

Index(['workclass', 'education', 'marital-status', 'occupation',
       'relationship', 'race', 'gender', 'native-country', 'income'],
      dtype='object')

In [26]:
for col in str_cols:
    print(f'- {col}:')
    print(df2[col].value_counts(dropna=False))
    print("\n\n")

- workclass:
Private             28312
Self-emp-not-inc     3735
Local-gov            3011
?                    2411
State-gov            1927
Self-emp-inc         1644
Federal-gov          1397
Without-pay            21
Never-worked           10
Name: workclass, dtype: int64



- education:
HS-grad         12919
Some-college     9188
Bachelors        6967
Masters          2499
Assoc-voc        1961
11th             1598
Assoc-acdm       1563
10th             1277
7th-8th           931
Prof-school       813
9th               737
12th              618
Doctorate         576
5th-6th           498
1st-4th           242
Preschool          81
Name: education, dtype: int64



- marital-status:
Married-civ-spouse       19215
Never-married            13360
Divorced                  6218
Separated                 1512
Widowed                   1499
Married-spouse-absent      627
Married-AF-spouse           37
Name: marital-status, dtype: int64



- occupation:
Prof-specialty       5679
Exec-mana

In [27]:
# Check summary statistics
df2.describe()

Unnamed: 0,age,educational-num,capital-gain,capital-loss,hours-per-week
count,42468.0,42468.0,42468.0,42468.0,42468.0
mean,39.476947,10.094801,1226.217128,99.859212,40.650702
std,13.779595,2.658658,7931.500736,429.072095,12.86796
min,17.0,1.0,0.0,0.0,1.0
25%,29.0,9.0,0.0,0.0,38.0
50%,38.0,10.0,0.0,0.0,40.0
75%,49.0,13.0,0.0,0.0,45.0
max,90.0,16.0,99999.0,4356.0,99.0


I have to move on with df2 - Adult income Dataset