<a href="https://colab.research.google.com/github/KryssyCo/DS-Unit-2-Applied-Modeling/blob/master/Krista_Shepard_DSPT2_U2S8M1_Assignment_Applied_Modeling_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Applied Modeling, Module 1

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Choose which observations you will use to train, validate, and test your model. And which observations, if any, to exclude.
- [ ] Determine whether your problem is regression or classification.
- [ ] Choose your evaluation metric.
- [ ] Begin with baselines: majority class baseline for classification, or mean baseline for regression, with your metric of choice.
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" information from the future?

## Reading

### ROC AUC
- [Machine Learning Meets Economics](http://blog.mldb.ai/blog/posts/2016/01/ml-meets-economics/)
- [ROC curves and Area Under the Curve explained](https://www.dataschool.io/roc-curves-and-auc-explained/)
- [The philosophical argument for using ROC curves](https://lukeoakdenrayner.wordpress.com/2018/01/07/the-philosophical-argument-for-using-roc-curves/)

### Imbalanced Classes
- [imbalance-learn](https://github.com/scikit-learn-contrib/imbalanced-learn)
- [Learning from Imbalanced Classes](https://www.svds.com/tbt-learning-imbalanced-classes/)

### Last lesson
- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [How Shopify Capital Uses Quantile Regression To Help Merchants Succeed](https://engineering.shopify.com/blogs/engineering/how-shopify-uses-machine-learning-to-help-our-merchants-grow-their-business)
- [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by Kevin Markham, with video
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)

## This is where I obtained my data.

https://www.apartmentlist.com/rentonomics/rental-price-data/

https://data.world/sya/marijuana-laws-by-state

Updated marijuana laws by state csv with data from 

https://www.governing.com/gov-data/safety-justice/state-marijuana-laws-map-medical-recreational.html
 
and confirmed information on:

https://disa.com/map-of-marijuana-legality-by-state


In [1]:
# Read in csv files and look at top and bottom 5 observations.
import pandas as pd
rent = pd.merge(pd.read_csv('https://raw.githubusercontent.com/KryssyCo/Unit-2-Project/master/Apartment-List-Rent-Data-State_2019-8.csv'),
       pd.read_csv('https://raw.githubusercontent.com/KryssyCo/Unit-2-Project/master/state_marijuana_laws_09_2019%20-%20state_marijuana_laws_10_2016.csv'))
      
print(rent.shape)
rent.head()

(245, 74)


Unnamed: 0,Location,Location_Type,Bedroom_Size,Price_2014_01,Price_2014_02,Price_2014_03,Price_2014_04,Price_2014_05,Price_2014_06,Price_2014_07,Price_2014_08,Price_2014_09,Price_2014_10,Price_2014_11,Price_2014_12,Price_2015_01,Price_2015_02,Price_2015_03,Price_2015_04,Price_2015_05,Price_2015_06,Price_2015_07,Price_2015_08,Price_2015_09,Price_2015_10,Price_2015_11,Price_2015_12,Price_2016_01,Price_2016_02,Price_2016_03,Price_2016_04,Price_2016_05,Price_2016_06,Price_2016_07,Price_2016_08,Price_2016_09,Price_2016_10,Price_2016_11,Price_2016_12,Price_2017_01,Price_2017_02,Price_2017_03,Price_2017_04,Price_2017_05,Price_2017_06,Price_2017_07,Price_2017_08,Price_2017_09,Price_2017_10,Price_2017_11,Price_2017_12,Price_2018_01,Price_2018_02,Price_2018_03,Price_2018_04,Price_2018_05,Price_2018_06,Price_2018_07,Price_2018_08,Price_2018_09,Price_2018_10,Price_2018_11,Price_2018_12,Price_2019_01,Price_2019_02,Price_2019_03,Price_2019_04,Price_2019_05,Price_2019_06,Price_2019_07,Price_2019_08,medical_mj_legalized,recreational_mj_legalized,no_mj_legalization
0,Alabama,State,Studio,573,573,574,574,575,576,578.0,580.0,580.0,580.0,580.0,580.0,580.0,580.0,580.0,581.0,582.0,581.0,581.0,579,579,578.0,579.0,581.0,580.0,578,574,578,582,585,587,588,588,589,589,591,592,593,594,595,598,600,602,602,601,600,600,600,601,598,598,600,605,609,608,607,606,609,611,613,612,610,609,613,618,622,622,621,,,Yes
1,Alabama,State,1br,624,624,625,625,626,627,629.0,631.0,631.0,631.0,631.0,632.0,631.0,631.0,632.0,632.0,633.0,633.0,633.0,631,630,629.0,630.0,633.0,631.0,629,625,630,633,637,639,640,640,641,642,643,644,645,646,648,650,653,655,655,654,653,654,653,654,651,651,654,658,663,662,661,659,663,665,667,666,664,663,667,672,677,677,676,,,Yes
2,Alabama,State,2br,758,758,759,759,760,762,765.0,766.0,767.0,766.0,767.0,767.0,767.0,767.0,767.0,768.0,769.0,768.0,769.0,766,765,765.0,766.0,769.0,766.0,764,760,765,769,774,776,777,778,779,779,781,783,784,785,787,790,793,796,796,795,794,794,794,795,791,791,794,800,805,804,803,801,805,808,811,809,807,805,811,817,823,823,822,,,Yes
3,Alabama,State,3br,1013,1013,1014,1015,1016,1018,1022.0,1024.0,1025.0,1024.0,1024.0,1026.0,1025.0,1025.0,1025.0,1027.0,1028.0,1027.0,1027.0,1024,1023,1022.0,1023.0,1027.0,1024.0,1021,1015,1022,1028,1034,1037,1039,1039,1041,1042,1044,1046,1048,1049,1052,1056,1060,1063,1063,1062,1061,1061,1061,1062,1056,1057,1061,1069,1076,1074,1072,1070,1076,1080,1083,1081,1078,1076,1083,1091,1099,1100,1098,,,Yes
4,Alabama,State,4br,1178,1178,1179,1180,1182,1184,1188.0,1191.0,1192.0,1191.0,1192.0,1193.0,1192.0,1192.0,1193.0,1194.0,1196.0,1194.0,1195.0,1191,1189,1189.0,1190.0,1195.0,1191.0,1187,1181,1189,1196,1203,1206,1208,1209,1210,1211,1214,1216,1219,1220,1224,1228,1233,1237,1237,1236,1234,1234,1234,1235,1229,1229,1234,1243,1251,1250,1248,1245,1252,1256,1260,1257,1254,1251,1260,1269,1279,1279,1277,,,Yes


In [2]:
rent.tail()

Unnamed: 0,Location,Location_Type,Bedroom_Size,Price_2014_01,Price_2014_02,Price_2014_03,Price_2014_04,Price_2014_05,Price_2014_06,Price_2014_07,Price_2014_08,Price_2014_09,Price_2014_10,Price_2014_11,Price_2014_12,Price_2015_01,Price_2015_02,Price_2015_03,Price_2015_04,Price_2015_05,Price_2015_06,Price_2015_07,Price_2015_08,Price_2015_09,Price_2015_10,Price_2015_11,Price_2015_12,Price_2016_01,Price_2016_02,Price_2016_03,Price_2016_04,Price_2016_05,Price_2016_06,Price_2016_07,Price_2016_08,Price_2016_09,Price_2016_10,Price_2016_11,Price_2016_12,Price_2017_01,Price_2017_02,Price_2017_03,Price_2017_04,Price_2017_05,Price_2017_06,Price_2017_07,Price_2017_08,Price_2017_09,Price_2017_10,Price_2017_11,Price_2017_12,Price_2018_01,Price_2018_02,Price_2018_03,Price_2018_04,Price_2018_05,Price_2018_06,Price_2018_07,Price_2018_08,Price_2018_09,Price_2018_10,Price_2018_11,Price_2018_12,Price_2019_01,Price_2019_02,Price_2019_03,Price_2019_04,Price_2019_05,Price_2019_06,Price_2019_07,Price_2019_08,medical_mj_legalized,recreational_mj_legalized,no_mj_legalization
240,Wyoming,State,Studio,557,554,553,553,555,556,557.0,555.0,554.0,555.0,559.0,562.0,560.0,560.0,562.0,562.0,561.0,559.0,561.0,562,558,559.0,558.0,558.0,550.0,549,546,550,548,548,546,546,546,545,546,545,547,548,546,543,542,544,546,546,544,541,538,539,541,542,541,542,542,544,543,543,542,543,544,545,545,545,545,546,548,550,549,548,,,Yes
241,Wyoming,State,1br,624,621,620,620,622,623,624.0,621.0,620.0,622.0,626.0,630.0,628.0,627.0,630.0,630.0,629.0,627.0,628.0,629,626,627.0,626.0,625.0,617.0,615,612,616,614,614,612,612,612,611,611,611,613,614,612,609,608,610,612,612,610,607,603,604,607,608,607,607,608,609,608,608,608,608,609,610,611,611,611,612,614,616,615,614,,,Yes
242,Wyoming,State,2br,801,797,796,796,798,800,801.0,798.0,797.0,798.0,804.0,808.0,806.0,805.0,808.0,809.0,807.0,805.0,807.0,808,803,805.0,804.0,803.0,792.0,790,786,792,788,788,786,786,786,785,785,784,788,789,786,781,781,783,786,786,783,779,775,776,779,781,779,780,781,782,781,781,780,781,783,784,784,784,784,786,788,791,790,789,,,Yes
243,Wyoming,State,3br,1089,1084,1083,1082,1085,1087,1089.0,1085.0,1083.0,1085.0,1093.0,1099.0,1096.0,1095.0,1099.0,1100.0,1097.0,1094.0,1097.0,1099,1092,1094.0,1092.0,1092.0,1076.0,1073,1068,1076,1072,1071,1068,1068,1069,1067,1067,1066,1071,1072,1069,1062,1061,1065,1069,1069,1064,1059,1053,1055,1059,1061,1059,1060,1061,1064,1062,1061,1061,1062,1064,1066,1066,1066,1066,1069,1072,1075,1074,1072,,,Yes
244,Wyoming,State,4br,1288,1283,1281,1280,1284,1286,1289.0,1283.0,1281.0,1284.0,1293.0,1300.0,1297.0,1295.0,1300.0,1301.0,1298.0,1294.0,1297.0,1300,1292,1294.0,1293.0,1292.0,1273.0,1270,1264,1273,1268,1268,1264,1264,1264,1262,1263,1262,1267,1269,1264,1257,1256,1260,1265,1264,1259,1253,1246,1248,1253,1256,1253,1254,1256,1259,1257,1256,1255,1257,1259,1261,1261,1261,1261,1264,1268,1272,1270,1269,,,Yes


## Choose your target. Which column in your tabular dataset will you predict?

### The column I choose to use as my target is the column currently labeled " Price_2019_08". I intend on changing the column names to dates and from there, using Panda's to_datetime.

### I've done some initial data exploration on the column in the cells below.

In [3]:
# The mean rent for all states as of August 2019 was $1135. The least expensive 
# was $489 and most expensive was $2967
rent['Price_2019_08'].describe()

count     245.000000
mean     1135.395918
std       465.079284
min       489.000000
25%       775.000000
50%      1069.000000
75%      1355.000000
max      2967.000000
Name: Price_2019_08, dtype: float64

In [6]:
# Check columns for NaN's and there are none.
rent['Price_2019_08'].isnull().sum()

0

## Choose which observations you will use to train, validate, and test your model. And which observations, if any, to exclude.

### The observations for my data are all of the states and the District of Columbia. I may begin by using all of the states and whittle it down to a more manageable number. At this time each state has 5 entries.

In [9]:
rent['Location'].describe(exclude='number')

count                245
unique                49
top       North Carolina
freq                   5
Name: Location, dtype: object

In [10]:
rent['Location'].value_counts()

North Carolina    5
Wisconsin         5
New Mexico        5
New Hampshire     5
Rhode Island      5
Arizona           5
Illinois          5
North Dakota      5
Montana           5
Idaho             5
Colorado          5
Minnesota         5
Indiana           5
South Carolina    5
Georgia           5
Mississippi       5
Louisiana         5
West Virginia     5
Nebraska          5
Tennessee         5
South Dakota      5
Arkansas          5
New York          5
Alaska            5
Connecticut       5
Virginia          5
Texas             5
Iowa              5
Maryland          5
Maine             5
Nevada            5
Kansas            5
Oklahoma          5
Vermont           5
Alabama           5
Utah              5
Washington        5
Florida           5
New Jersey        5
Michigan          5
Oregon            5
Missouri          5
Massachusetts     5
California        5
Kentucky          5
Wyoming           5
Delaware          5
Pennsylvania      5
Ohio              5
Name: Location, dtyp