# Machine Learning Project

Author: Nian Vrey

# Project Overview

Working with different datasets to create models that will predict outcomes based on the data given.

## Setup

In [6]:
# Imports
import pandas as pd

In [7]:
# Configurations
pd.set_option('display.max_columns', None)

# Adult Income

## Data Dictionary

age: continuous.

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov,
State-gov, Without-pay, Never-worked.

fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm,
Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th,
Preschool.

education-num: continuous.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated,
Widowed, Married-spouse-absent, Married-AF-spouse.

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial,
Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical,
Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv,
Armed-Forces.

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative,
Unmarried.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex: Female, Male.

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany,
Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran,
Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal,
Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia,
Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador,
Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

class: >50K, <=50K

### Uploading the data

In [8]:
# Read csv into DataFrame
fpath_adult_income = 'https://raw.githubusercontent.com/YoungVoid/Machine-Learning/main/Adult%20Income%20Dataset/adult.csv'
df_adult_income = pd.read_csv(fpath_adult_income)

In [10]:
df_adult_income.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [9]:
df_adult_income.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


## Initial Data Analysis

> Source of data

https://www.kaggle.com/datasets/wenruliu/adult-income-dataset

> Brief description of data

The data notes down personal information of an adult, such as Age and Occupation

> What is the target?

The target to predict is the income class that the adult falls in, whether they make more than 50k per annum, or 50k and less.

> What does one row represent? (A person? A business? An event? A product?)

Each row represents an adult person.

> Is this a classification or regression problem?

This is a classification problem, with 2 outcomes (more than 50k, 50k and less)

> How many features does the data have?

There are 14 features

> How many rows are in the dataset?

There are 48842 rows

> What, if any, challenges do you foresee in cleaning, exploring, or modeling this dataset?

There are a number of missing values, represented by a ? instead of being Null. Otherwise it will just be a normal data cleaning process, with some renaming and changing of values.

# Car Insurance Claims

## Initial Data Analysis

In [20]:
# Filepath for water pumps folder
fpath_car ='https://github.com/YoungVoid/Machine-Learning/raw/main/Car%20Insurance%20Dataset/Car_Insurance_Claim.csv'

# Join features and target DataFrames into one DataFrame
df_car = pd.read_csv(fpath_car)

In [21]:
df_car.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   10000 non-null  int64  
 1   AGE                  10000 non-null  object 
 2   GENDER               10000 non-null  object 
 3   RACE                 10000 non-null  object 
 4   DRIVING_EXPERIENCE   10000 non-null  object 
 5   EDUCATION            10000 non-null  object 
 6   INCOME               10000 non-null  object 
 7   CREDIT_SCORE         9018 non-null   float64
 8   VEHICLE_OWNERSHIP    10000 non-null  float64
 9   VEHICLE_YEAR         10000 non-null  object 
 10  MARRIED              10000 non-null  float64
 11  CHILDREN             10000 non-null  float64
 12  POSTAL_CODE          10000 non-null  int64  
 13  ANNUAL_MILEAGE       9043 non-null   float64
 14  VEHICLE_TYPE         10000 non-null  object 
 15  SPEEDING_VIOLATIONS  10000 non-null  

In [22]:
df_car.head()

Unnamed: 0,ID,AGE,GENDER,RACE,DRIVING_EXPERIENCE,EDUCATION,INCOME,CREDIT_SCORE,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,VEHICLE_TYPE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,OUTCOME
0,569520,65+,female,majority,0-9y,high school,upper class,0.629027,1.0,after 2015,0.0,1.0,10238,12000.0,sedan,0,0,0,0.0
1,750365,16-25,male,majority,0-9y,none,poverty,0.357757,0.0,before 2015,0.0,0.0,10238,16000.0,sedan,0,0,0,1.0
2,199901,16-25,female,majority,0-9y,high school,working class,0.493146,1.0,before 2015,0.0,0.0,10238,11000.0,sedan,0,0,0,0.0
3,478866,16-25,male,majority,0-9y,university,working class,0.206013,1.0,before 2015,0.0,1.0,32765,11000.0,sedan,0,0,0,0.0
4,731664,26-39,male,majority,10-19y,none,working class,0.388366,1.0,before 2015,0.0,0.0,32765,12000.0,sedan,2,0,1,1.0


> Source of data

https://www.kaggle.com/datasets/sagnik1511/car-insurance-data

> Brief description of data

Personal information about customers who have and have not claimed.

> What is the target?

Whether a person has claimed or not.

> What does one row represent? (A person? A business? An event? A product?)

One row represents a customer.

> Is this a classification or regression problem?

Classification with 2 outcomes (claimed, not claimed)

> How many features does the data have?

18 features

> How many rows are in the dataset?

10000

> What, if any, challenges do you foresee in cleaning, exploring, or modeling this dataset?

Age and Driving experience is in ranges, but other than that, just normal data cleaning processing. There is no data dictionary for this dataset, which may or may not cause problems, but most of the column names are easily understood so impact should be relatively low.