# Financial Product Recommendation - Part 1


## Tables of Content

1. [Introduction](#1)
2. [Problem Statement](#2)
3. [Libraries](#3)
4. [Data Set](#4)
      - [4.1 Data Dictionary ](#4.1)
      - [4.2 Loading Data ](#4.2)
      - [4.3 Data Set Info & Display ](#4.3)
5. [Data Cleaning](#5)
6. [Numerical Outliers Check](#6)
7. [Categorical Columns Cleaning](#7)
      - [7.1 Text Cleaning Process ](#7.1)
      - [7.2 Transforming Columns ](#7.2)
8. [Saving the Data for EDA and ML purposes](#8)


## 1. Introduction<a id="1"></a>

This project focuses on predicting an individual's income bracket using census data, leveraging features like age, workclass, education, occupation, and weekly working hours. The dataset comprises demographic and employment-related information. Our objective is to classify individuals into one of two income categories: __<=50K or >50K__ . By employing machine learning algorithms, we will explore how these variables influence income and develop a predictive model to accurately determine whether a person earns more or less than $50,000 annually based on these factors and provide Financial Products like Stocks/Credit-Cards and Loans.

## 2. Problem Statement
<a id="2"></a>

The goal is to build a machine learning model that can accurately predict whether an individual earns more or less than $50,000 per year using these features. This prediction can provide insights into the socio-economic factors influencing income distribution

## 3. Libraries Used
<a id="3"></a>



In [1]:
import numpy as np
import pandas as pd
import pandasql
import matplotlib 
import seaborn
import os



## 4. Dataset
<a id="4"></a>

In this project we will be sourcing Cenus Dataset from UCI Machine Learning Repositatory. It Contains Mix of Catagorical Columns (WorkClass/Educational Levels/Martial-Status and others) and Numerical Columns (Capital Gain/Age/Hours_per_week and Others)

## 4.1 Data Dictionary
<a id="4.1"></a>

<table style="border-collapse: collapse; width: 100%; font-family: Arial, sans-serif;">
  <thead>
    <tr style="background-color: #f2f2f2;">
      <th style="border: 1px solid #ddd; padding: 8px; font-weight: bold; text-align: left;">Feature</th>
      <th style="border: 1px solid #ddd; padding: 8px; font-weight: bold; text-align: left;">Type</th>
      <th style="border: 1px solid #ddd; padding: 8px; font-weight: bold; text-align: left;">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">age</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Numerical</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Age of the individual</td>
    </tr>
    <tr style="background-color: #f9f9f9;">
      <td style="border: 1px solid #ddd; padding: 8px;">workclass</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Categorical</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Type of employment (e.g., Private, Self-emp-not-inc)</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">education_level</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Categorical</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Highest level of education completed (e.g., Bachelors, HS-grad)</td>
    </tr>
    <tr style="background-color: #f9f9f9;">
      <td style="border: 1px solid #ddd; padding: 8px;">education-num</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Numerical</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Number of years of education completed</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">marital-status</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Categorical</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Marital status (e.g., Never-married, Divorced)</td>
    </tr>
    <tr style="background-color: #f9f9f9;">
      <td style="border: 1px solid #ddd; padding: 8px;">occupation</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Categorical</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Type of occupation (e.g., Exec-managerial, Handlers-cleaners)</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">relationship</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Categorical</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Family relationship (e.g., Husband, Wife)</td>
    </tr>
    <tr style="background-color: #f9f9f9;">
      <td style="border: 1px solid #ddd; padding: 8px;">race</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Categorical</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Race of the individual (e.g., White, Black)</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">sex</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Categorical</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Gender of the individual (Male, Female)</td>
    </tr>
    <tr style="background-color: #f9f9f9;">
      <td style="border: 1px solid #ddd; padding: 8px;">capital-gain</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Numerical</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Capital gains from investments</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">capital-loss</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Numerical</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Capital losses from investments</td>
    </tr>
    <tr style="background-color: #f9f9f9;">
      <td style="border: 1px solid #ddd; padding: 8px;">hours-per-week</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Numerical</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Number of hours worked per week</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">native-country</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Categorical</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Country of origin (e.g., United-States, Cuba)</td>
    </tr>
    <tr style="background-color: red;">
      <td style="border: 1px solid #ddd; padding: 8px;">income</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Categorical</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Income class (<=50K, >50K)</td>
    </tr>
  </tbody>
</table>


## 4.2 Loading Data
<a id="4.2"></a>

In [2]:
# Loading the Data
info = pd.read_csv('/Users/bhargavdevarapalli/Downloads/Brainstation_capstone_Census/data/census.csv')

## 4.3 Data Set Info & Display
<a id="4.3"></a>



In [3]:
print('Total Number of  Records : {} '.format(info.shape[0]))
print('Total Number of  Columns : {} '.format(info.shape[1]))
print('')
print('')
print('')
print('----------------Data Types-----------------')
print( info.dtypes )
print('')
print('')
print('')
info.head()

Total Number of  Records : 45222 
Total Number of  Columns : 14 



----------------Data Types-----------------
age                  int64
workclass           object
education_level     object
education-num      float64
marital-status      object
occupation          object
relationship        object
race                object
sex                 object
capital-gain       float64
capital-loss       float64
hours-per-week     float64
native-country      object
income              object
dtype: object





Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
2,38,Private,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
3,53,Private,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
4,28,Private,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


## 5.0 Data Cleaning
<a id="5"></a>

We will be checking for Null values/Missing Values 


In [4]:
# Checking for Outliers
info.isna().sum()

age                0
workclass          0
education_level    0
education-num      0
marital-status     0
occupation         0
relationship       0
race               0
sex                0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
dtype: int64

## 6.0 Numerical Outliers check
<a id="6"></a>

Below we are checking for any outliers in the Numerical Columns 

In [5]:
info.describe()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week
count,45222.0,45222.0,45222.0,45222.0,45222.0
mean,38.547941,10.11846,1101.430344,88.595418,40.938017
std,13.21787,2.552881,7506.430084,404.956092,12.007508
min,17.0,1.0,0.0,0.0,1.0
25%,28.0,9.0,0.0,0.0,40.0
50%,37.0,10.0,0.0,0.0,40.0
75%,47.0,13.0,0.0,0.0,45.0
max,90.0,16.0,99999.0,4356.0,99.0


#####  [Findings]    We Could see the Age/ Education Number/ Hours per week are quite consistent but capital-Loss and Capital- Gain are not consistent across and might need to check More in the EDA Portion of things

## 7.0 Categorical Columns Cleaning
<a id="7"></a>

We will be creating text cleanign function to pass it throught all the Categorical Columns to make sure data is consistent across




## 7.1 Text Cleaning Function
<a id="7.1"></a>

In [6]:
def textcleaning(text):

    '''
    Removing New line Characters -  \n

    Removing Spaces - ' '

    Removing tabs - '\t'

    Removing New line Characters type2 -  \r

    Converting everything into uppercase 

    '''



    return    text.replace('\t','').replace('\n','').replace('\r','').replace(' ','').upper()
    

    

## 7.2 Transforming Columns
<a id="7.2"></a>

    1) Taking a Copy of existing dataframe to make changes
    2) Running text clean function to all the Categorical Columns 
    3) Converting Income Column to Binary -  <=50k (0)    >50K (1)

In [7]:
# Copying the data into a Newdataframe

info_transform = info.copy()


# Getting all the Categorical Columns
col_list = list(info_transform.select_dtypes(include=['object']).columns)


# Converting Each column 

for i in col_list:
    
    info_transform[i] = info_transform[i].apply(textcleaning)
    

# Converting Income to Binary values


info_transform['income'] = (info_transform['income']== '>50K').astype(int)


    


## 8 Saving the Data for EDA and ML purposes
<a id="8"></a>

     1) Delete the Existing csv file named- cenus_modified.csv
     
     2) Saving the transformed Dataframe to a 'cenus_modified.csv' file for EDA and ML Section
     

In [8]:
import os

#1------------------------------

# file path
file_path = '/Users/bhargavdevarapalli/Downloads/Brainstation_capstone_Census/data/census_modified.csv'

# Checking if it is present before Deleting it

if os.path.exists(file_path):
    os.remove(file_path)
    print(f"{file_path} has been deleted.")
    
else:
    print(f"{file_path} does not exist.")


#2------------------------------
info_transform.to_csv('/Users/bhargavdevarapalli/Downloads/Brainstation_capstone_Census/data/census_modified.csv')
print('     ')
print('New File is located in the Data Folder :')
print('     ')
print(os.listdir('/Users/bhargavdevarapalli/Downloads/Brainstation_capstone_Census/data/'))

/Users/bhargavdevarapalli/Downloads/Brainstation_capstone_Census/data/census_modified.csv has been deleted.
     
New File is located in the Data Folder :
     
['my_data.csv', 'census_modified.csv', 'census.csv', 'data_links.md']
