For this assignment you will be working with the [Adult](https://archive.ics.uci.edu/dataset/2/adult) dataset from the UC Irvine Machine Learning Repository.

This will be a pretty open ended assignment where you will have to apply the concepts you learned in the past few weeks towards building a model that can predict if an adult make less than or equal to or greater than $50,000 in annual income.

There are 17 open ended questions, make sure to answer them!

The folowing code will load the dataset into this notebook for you, make sure to read through the description of the dataset variables below:

**Target Variable**:

- Income: >50K, <=50K

**Features**:
- Age: `continuous`
- Workclass: `Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked`
- fnlwgt (Final Weight): `continuous`
- Education: `Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool`
- Education-num (Education Number): `continuous`
- Marital-status: `Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse`
- Occupation: `Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces`
- Relationship: `Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried`
- Race: `White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black`
- Sex: `Female, Male`
- Capital-gain: `continuous`
- Capital-loss: `continuous`
- Hours-per-week: `continuous`
- Native-country: `United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands`

In [None]:
! pip install ucimlrepo



In [None]:
from ucimlrepo import fetch_ucirepo
import pandas as pd

In [None]:
adult = fetch_ucirepo(id=2)
X = adult.data.features
y = adult.data.targets

In [None]:
adult.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,age,Feature,Integer,Age,,,no
1,workclass,Feature,Categorical,Income,"Private, Self-emp-not-inc, Self-emp-inc, Feder...",,yes
2,fnlwgt,Feature,Integer,,,,no
3,education,Feature,Categorical,Education Level,"Bachelors, Some-college, 11th, HS-grad, Prof-...",,no
4,education-num,Feature,Integer,Education Level,,,no
5,marital-status,Feature,Categorical,Other,"Married-civ-spouse, Divorced, Never-married, S...",,no
6,occupation,Feature,Categorical,Other,"Tech-support, Craft-repair, Other-service, Sal...",,yes
7,relationship,Feature,Categorical,Other,"Wife, Own-child, Husband, Not-in-family, Other...",,no
8,race,Feature,Categorical,Race,"White, Asian-Pac-Islander, Amer-Indian-Eskimo,...",,no
9,sex,Feature,Binary,Sex,"Female, Male.",,no


In [None]:
df = pd.concat([X, y], axis=1)

In [None]:
df.head()

The above code will load the dataset into the variable `df`. Now it's up to you to predict an Adult's income. Here are a few pointers and TODOs when approaching this problem. As you go through this assignment, make sure to look at the following and answer the questions **in this same maarkdown cell** you see here:

1. **Does the data need to be cleaned?**
   - Do some initial analysis, look at the data, and see if we need to use some pandas functions to reformat data cells (hint: you will have to do this, as this is somewhat of a messy dataset).
   - Do we really need all the columns? Do a bit of research and ask questions in the discord to see if we need some of the columns in this dataset!
     - ***Question 1: When do we actually drop a column?***

2. **Deal with missing values**
   - A few columns have missing values; should you drop them? Impute these missing values as you see fit, such that the end model has good performance (note that this is a trial and error process as explained in class).
   - There are several methods for this; look at the 10/30 class notebook.
     - ***Question 2: Which methods of missing value imputation worked the best for you in the end? Do you think there is a specific reason why?***

3. **Encode the categorical data**
   - There are several categorical features in this data. We discussed several ways of encoding these features and when to use some over the others, so use these methods to encode the data!
     - ***Question 3: What is the difference between label encoding, one-hot encoding, and target encoding? For each one, list the pros and cons.***
     - ***Question 4: Do some research on your own; what are some other ways of encoding categorical data?***

4. **Split your data into a train and test dataset**
   - When splitting your data into a train and test dataset, there are several parameters you have to pass in/consider (many of which can be optimized with a trial and error process):
     - ***Question 5: What proportion of your dataset is the training dataset? Why did you choose this ratio?***
     - ***Question 6: Explain the difference between the train and test dataset. Why do we need them? Why can't we just train the model on the entire training dataset?***
     - ***Question 7: Note that the dataset is very imbalanced. Why does this matter for splitting your dataset?***

5. **Scale your dataset**
   - Note that when you scale your dataset, you fit & transform the training data and only transform the test data.
     - ***Question 8: Why do you only fit to the training data and not the test data?***
     - ***Question 9: In class, I talked about the standard scaler, but there are other methods (MinMax, Normalization, etc.). Do some research into a few other methods and explain each one. Make sure to mention when to use one over the other.***
     - ***Question 10: What is the purpose of scaling your data in the first place?***

6. **Modeling + Evaluation**
   - For the assignment, make the following models to predict the binary "income" feature and output the accuracy, precision, recall, and F1 score, just use the default parameters:
     - Linear Regression
       - Note: Linear Regression does not make sense for this type of model, so make it so that any values above 0.5 are predicted as a 1 and 0 otherwise.
       - ***Question 11: Why does linear regression not make sense to do here?***
     - Logistic Regression
       - ***Question 12: Why does it make more sense to use logistic regression than linear?***
     - Decision Trees
     - Random Forest
     - AdaBoost
       - ***Question 13: We did not talk about AdaBoost in class, but do some research on this type of model and first explain how it works (briefly).***
       - ***Question 14: What is the idea of a stump?***
       - ***Question 15: What is boosting and how does it apply to tree-based models?***

7. **For the best model from step 6, use GridSearchCV and RandomSearchCV for tuning the model parameters; do some research into the hyperparameters:**
   - ***Question 16: Which parameters did you choose to tune?***
   - ***Question 17: Did your GridSearch and RandomSearch output the same values for the hyperparameters? Why or why not?***


In [None]:
# Do steps 1 - 7 here!