In [None]:
import numpy as np
import pandas as pd
from datascience import *
import scipy

# These lines do some fancy plotting magic. 
# You can use either seaborn, matplot, or datascience for visualizations.
import seaborn as sns

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from matplotlib import patches

# To help you export your results
import otter
grader = otter.Notebook("exploration3.ipynb")

## Exploration 3: Prediction & Classification

Welcome to the final assignment of Data 8X! In this project, you will practice building a predictive model using ordinary least squares regression to predict a quantitative variable or k-nearest neighbors classification to predict a categorical variable. You may choose either option, depending on your dataset and research question/goal; neither option is better or worse than the other, so please choose the one that you find most interesting or applicable.

This assignment will be due at the end of the semester on **Friday, May 14th at 11:59 PM Pacific Time.** Please submit your .zip file, which will contain your .ipynb file and .pdf, to Gradescope before that time. If you have any questions, you may email Ian at castro.ian@berkeley.edu or post on our class Piazza page. Office hours are available via appointment. 

**IMPORTANT NOTE:** We will **NOT** accept any late submissions or provide extensions for this assignment, due to the grading deadline, so please start on this and ask questions as you go. 

The following notebook will help you guide your analysis. Complete all steps and present your work in a clear, readable manner -- imagine you would be presenting this notebook to a colleague, so make your code clear and use comments as you need.

<!-- BEGIN QUESTION -->
<!--
BEGIN QUESTION
name: exp3
manual: True
points: 100
-->

### Part 1.1: Importing Data and Choosing Tools

To begin our analysis, please import your dataset. If you're coding in the Berkeley DataHub, you can upload your dataset by clicking on File > Open... and then Upload. Otherwise, if you are on your local system, just put the notebook in the same folder as your dataset. For this project, you are allowed to use any library you feel comfortable using, such as `pandas` or `datascience`.

If you are working in the `datascience` library, please use a .csv file and import it using `Table().read_table("my_file.csv")`. If you are working in pandas, you can import your data using `pd.read_csv("my_file.csv")`, or any of the other functions such as `pd.read_json` described [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) if you're using a different file type for your dataset.

In [None]:
# Import your dataset here and check that it imported correctly.
...

Now that you have your data imported, think of your goal. What information does your dataset provide? What type of variable do you want to predict? Are you trying to make a predictive model given unknown data, or are you just trying to learn about patterns such as the relationship between two variables (i.e. analyzing slopes)? The following table summarizes some information about the tools you can use in this project.



| | Bivariate Linear Regression | Multivariate Linear Regression | K-Nearest Neighbors Classification |
| --- | --- | --- | --- | 
| Goal: | Predict a quantitative output (y), given an input x  | Predict a quantiative output (y), given any number of x-variables| Predict a (binary or multi-class) categorical variable |
| Approach: | Algebraic (ex. slope = r * SDy/SDx) or machine learning (sklearn) | Linear algebra ([lecture](https://ds100.org/sp21/lecture/lec13/)) or machine learning (sklearn) | Coding from scratch or machine learning (sklearn) |
| Resources: | [Data 8](https://inferentialthinking.com/chapters/15/Prediction.html) and sklearn [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) | sklearn [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) or lecture above | [Data 8](https://inferentialthinking.com/chapters/17/Classification.html) and sklearn [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) |


**FRQ 1:** Provide some initial background on your dataset and goals. What does your dataset show and where does it come from? What research question are you trying to answer? What do you want to predict? What method/approach, from the table above, do you plan to use? 

(Note: your questions/approach might change as you perform EDA in the next step, so it's okay if you change your mind.)

*Type your answer in this cell, replacing this text.*

### Part 1.2: Exploratory Data Analysis (EDA) & Data Cleaning

Now that we have our data, we're ready to code! Perform some preliminary exploratory data analysis and data cleaning on your dataset. This is where visualizations will come in handy; feel free to use `datascience`, `matplotlib`, and `seaborn` to make your graphs. **Remember to include any necessary comments, as well as titles, axis labels, legends, colors, etc. to make your graphs easy to understand.** You should include at least **3** useful visualizations related to your data.

Here are some useful documentation for you to reference as you do this work:
[datascience](http://data8.org/sp21/python-reference.html),
[pandas](https://pandas.pydata.org/docs/),
[matplotlib](https://matplotlib.org/stable/contents.html),
[seaborn](https://seaborn.pydata.org/),
 Google, and StackOverflow.

To get you started, here are some things to think about:
1. For linear regression: what independent x variables are linearly related to your output variable y?
1. For classification: what variables help us differentiate our classes the best?
1. If necessary, how could I transform my data (ex. x --> 1/x, or using the average of a set of variables) to make it more linear, better at differentiating groups, or more generally, help create better predictions? 


As for cleaning your dataset for use, here are some questions to think about:
1. Are there any null values? If so, how will you remove/replace them? 
1. Are there any outliers? Will you need to remove them?
1. Do you need to convert any data types (ex. strings to floats)?

As part of this, calculate any necessary statistics (r, means, standard deviations, etc.) to better understand your dataset and import/create any helpful functions.



In [None]:
# Perform your EDA and data cleaning here. Add as many cells as you need.
...

In [None]:
# If you are using the datascience library, please run this cell
# to convert your data into a dataframe for use in sklearn. 
# Replace "cleaned_table" with the name of your cleaned dataset (as a table)
# and rename "clean_data_df" to a useful name.
clean_data_df = cleaned_table.to_df()
clean_data_df.head()

In [None]:
# If you want to export your cleaned data as a csv, run this cell.
# You can find the .csv file in the directory (click File > Open...)
# Make sure you're using a dataframe datatype, and remember to replace clean_data_df
# with your variable name.
clean_data_df.to_csv("your_cleaned_data.csv")

**FRQ 2:** Generally describe your exploratory data analysis and data cleaning process here. Did you notice anything interesting? Did you run into any trouble? How did you address issues or inconsistencies in the data? Did anything from your analysis cause you to change your answer for FRQ1? Lastly, moving forward, what variables do you plan to use for your predictive model based on this analysis?

*Type your answer in this cell, replacing this text.*

### Part 1.3: Train-Test Split

Now that you've looked at and cleaned your dataset, let's split it into a training/test set so we can properly evaluate your model once its built.

To do this, we will use scikit-learn's `train_test_split` function. Make sure your data is a **pandas dataframe** before doing this. 

`train_test_split` takes in three arguments in the following order: your input variables (the x's) you want to use to predict as a dataframe, your output variable (y) as an array or series, and the proportion of data you would like to use for the train/test split. The documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

Some recommended train-test splits are 80/20 (0.8 for training, 0.2 for testing) or 90/10 (0.9 train, 0.1 test). In general, when choosing, just make sure your training set is larger than the test set, because we want to use more data to train our model.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Fill in the ellipses. Don't change the random_state, so your results are repeatable.
X_train, X_test, y_train, y_test = train_test_split(..., ..., train_size = ..., random_state = 42)

# Check if your split worked:
X_train

### Part 2.1: Choosing Your Model

Now that we have our data set up and ready to go, you need to begin creating your model. The next cell imports the `sklearn` machine learning tools. To create a model object, all you need to do is to call the function and assign it to a name.

For example:

`my_linear_model = LinearRegression(fit_intercept = True)`

or

`my_knn_classifier = KNeighborsClassifier(n_neighbors = ??)`

The included arguments add an intercept to your linear regression line and let you decide the number of neighbors to check, respectively. More information on initializing your models can be found here: [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) and [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). 

However, if you are using an algebraic, linear algebra, or coding from scratch approach (i.e. the equations and code we've looked at and discussed in class), you can ignore the `sklearn` cells below. Just make a comment that you're using that approach.

In [None]:
# Imports the sklearn linear regression and K-NN tools.
# Run this cell.

from sklearn.linear_model import LinearRegression

from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Initialize your model/create a model object here.
# See the prompt above on how to do this.

...

### Part 2.2: Fitting Your Model

Now that you've initialized your model, go fit it to your **training set**. You can do this by using the following code:

`your_model.fit(X_train, y_train)`

If you're using a manual approach (equations, coding K-NN from scratch), show all of your work in building your model in the cells below. Add any cells you need.

In [None]:
# Fit or build your model here. 
...

**FRQ 3:** Report the details of your model here. If you're using a linear regression model, type the slope(s) and intercept below in the form: 

`output_var = slope1 * x_var1 + slope2 * x_var2 + slope3 * x_var3 + ... + intercept`, replacing it with the names of the variables you used and the slopes calculated above. 

You can check the slopes by running `your_model.coef_`, where the slopes are in the same order as the columns in `X_train`, and the intercept by running `your_model.intercept_` in a coding cell. 

If you're using a k-nearest neighbors classifier, just mention that you're doing that below, because we can't describe the model in words (only visually). If you're interested in plotting the decision boundaries, check out this [code](https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html). However, you don't need to do this; the level of plotting is outside of the scope of this class. 

*Type your answer in this cell, replacing this text.*

### Part 3.1: Predicting with your Model

Now that we've built your model, we can use it to predict values! This is where your **testing set** comes in.

In this section, use `X_test` and your model to make a prediction for every single value in the testing set. Set your predictions to the name `predictions`.

If you're using `sklearn` and fitted your model above, all you need to do is run:

`your_model.predict(X_test)`.

If you're using equations or a manual method, plug in the pieces of `X_test` to the corresponding parameters of your model.

In [None]:
# Create your predictions below.
predictions = ...

### Part 3.2: Evaluating your Model

We can now evaluate the accuracy of our model with this set of predictions. Calculate the following metrics to "score" your model, using `predictions` and `y_test`.

| Linear Regression | K-NN Classification |
| --- | --- |
| r squared, mean squared error | accuracy (% correct classifications) |

Here are some helpful metric functions from `sklearn` to help you evaluate your model. Feel free to use these, or any other function/code you write, to do these calculations. Information about these functions (and how to use them) can be found in the documentation [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics).

In [None]:
# Run this cell to import some useful functions.
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score 

In [None]:
# Evaluate your model here.
...

**FRQ 4:** What is the accuracy of your model? Report the metrics you calculated and interpret them in context. 

*Type your answer in this cell, replacing this text.*

### Part 4 (Optional): Adjusting & Finalizing Your Model

Great! Congratulations on building, predicting, and evaluating your model. But that's not all -- we can probably do better. Our goal now is to either reduce the mean squared error (if doing regression) or increase the accuracy (if doing K-NN classification). 

There are multiple ways we can do this. In linear regression, we can clean the data more (removing points that would skew the line) or change the features/x-variables we use in our model. For classification, we could change the features/x-variables we use or change k (the number of neighbors we check). These are only a few examples of what we can do.

In this section, try to adjust your model. Once you've updated your dataset (ex. by including new/different variables), you'll need to create a new training and test set; use `random_state = 42` to make sure you keep the same individuals in each set. Use that updated training set to fit your model, and then predict and evaluate that new model using the updated test set. 

Did you improve on your accuracy or reduce your mean squared error?

In [None]:
# Practice improving your model.
...

<!-- END QUESTION -->

Congratulations on finishing Data 8X! I hope the course was interesting and useful. If you would like to learn more and improve your data science skills, check out bCourses and the slides from our final live discussion on 4/30 for more resources. Feel free to reach out to Ian any time if you have any questions or are looking for additional learning opportunities in the field.

In [None]:
# Save your notebook first, run all cells, and then run this cell to export your submission.
# Please upload the .zip, which contains your .pdf and .ipynb, to Gradescope to submit! 
grader.export()