# Week 7



## The intro

Anyway. I'm sure you guys have a lot to do this week, so we'll try to keep it relatively light (although there should be enough optional exercises to keep you all busy).

[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/6b4EQk96SfQ/0.jpg)](https://www.youtube.com/watch?v=6b4EQk96SfQ)

Remember that last week you worked classifing data using *KNN*'s. We are going to continue working with machine learning, this time looking at *decision trees* and see how new information can influence the performance of our model in predicting which type of crime happened.

Specifically, crimes can have many causes, so we can combine datasouces to better understand what makes a criminal commit a crime. Are there specific factors which trigger that individual to act? Since criminals are notoriously shy about sharing information, we must try to find this out in a different way. Lucky for us, we can do this with data! 

*We are going to use weather data* from San Franciso to try to relate different crimes with meteorological conditions!

* We'll start with a relatively simple exercise focusing adding weather data to the decision tree from last week (Part 1, 2, and 3).
* Then we'll prepare a bit for next week, when we get into the topic of explanatory data visualization with some lectures and reading (Part 4)

## Part 1: Decision Tree Intro

Now we turn to decision trees. This is a fantastically useful supervised machine-learning method, that we use all the time in research. To get started on the decision trees, we'll use some fantastic *visual* introduction. 


*Decision Trees Reading 1*: The visual introduction to decision trees on this webpage is AMAZING. Take a look to get an intuitive feel for how trees work. Do not miss this one, it's a treat! http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

*Decision Trees Reading 2*: the second part of the visual introduction is about the topic of model selection, and bias/variance tradeoffs that we looked into earlier during this lesson. But once again, here those topics are visualized in a fantastic and inspiring way, that will make it stick in your brain better. So check it out http://www.r2d3.us/visual-intro-to-machine-learning-part-2/

*Decision Trees Reading 3*: Finally, you can also read about decision trees in DSFS, chapter 17. **You can get it on DTU Learn**

And our little session on decision trees wouldn't be complete without hearing from Ole about these things. 

[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/LAA_CnkAEx8/0.jpg)](https://www.youtube.com/watch?v=LAA_CnkAEx8)


*Decision tree "reading" 4*: And of course the best way to learn how to get this stuff rolling in practice, is to work through a tutorial or two. We recommend the ones below:
  * https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html
  * https://towardsdatascience.com/random-forest-in-python-24d0893d51c0 (this one also has good considerations regarding the one-hot encodings)
  
(But there are many other good ones out there.)

In [1]:
# # Ole explains decision trees
# YouTubeVideo("LAA_CnkAEx8",width=600, height=338)

> Exercises: Just a few questions to make sure you've read the text (DSFS chapter 17) and/or watched the video.
> 
> * There are two main kinds of decision trees depending on the type of output (numeric vs. categorical). What are they?
    Classification trees and regression trees.
> * Explain in your own words: Why is entropy useful when deciding where to split the data?
    * Because when the split is close to 0 or 1 then the entropy gives a low value while when it is arround 0.5 meaning it splits the data in half it is a high value indicating a good split.
> * Why are trees prone to overfitting? 
    * Because the model is very simple and you only stop training when you have made all splits pure. This can be adjusted by setting a minimum splitting boundry.
    
> * Explain (in your own words) how random forests help prevent overfitting.
    * Random forrest introduces randomness and therefore variance moving decision trees more towards the middle where bias/variance tradoff is minimized. This is also known as regularization.

## Part 2: Decision Tree Baseline


> *Exercise*: Decision trees and real-world crime data
> 
> The idea for today is to pick two crime-types that have *different geographical patterns* and *different temporal patterns*. We can then use various variables of the real crime data as categories to build a decision tree. I'm thinking we can use
> * `DayOfWeek` (`Sunday`, ..., `Saturday`). (Note: Will need to be encodede as integer in `sklearn`)
> * `PD District` (`TENDERLOIN`, etc). (Note: Will need to be encodede as integer in `sklearn`)
> 
> And we can extract a few more from the `Time` and `Date` variables
> * Hour of the day (1-24)
> * Month of the year (1-12)
> 
> So your job is to **select two crime categories** that (based on your analyses from the past three weeks) have different spatio-temporal patterns. Since we will use weather data to disitinguish later, let's try to think of crime categories that our intuition tells us might be strongly influenced by the weather conditions (type 1). And also think of other categories where we **don't** expect weather to play a role (type 2). We suggest:

* `BURGLARY or VEHICLE THEFT` for type 1. 
* `FORGERY/COUNTERFEITIN or FRAUD` for type 2. 

But you are free to choose other ones, if you like ðŸ¤“

Now we are going to to build is a decision tree (or, even better, a [Random Forest](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html), here is [another tutorial for Random Forests](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0)) classifier that takes as input the four labels (Hour-of-the-day, Day-of-the-week, Month-of-the-year, and PD-District) of a crime (from one of the two categories) and then tries to predict which category that crime is from.
>
> Some notes/hints
> * Remember to create a balanced dataset, that is, **grab an equal number of examples** from each of the two crime categories. Pick categories with lots of training data. It's probably nice to have something like 10000+ examples of each category to train on. 
> * Also, I recommend you grab your training data at `random` from the set of all examples, since we want crimes to be distributed equally over time.
> * A good option is the  `DecisionTreeClassifier`.
> * We recommed you build a separate Pandas `Dataframe` with it, so the process of adding the weather data will be as smooth as possible later on. The same goes for your testing data.
> * Create a function to evaluate the precision of your classifier. Make sure your test data is not used for training. (Since you have created a balanced dataset, the baseline performance (random guess) is 50%. How good can your classifier get?)
> * (Optional, although this one might improve performance). Does one hot encoding affect your results? Why/Why not?  
> * (Optional) Are your results tied to the specific training data you used? Are you overfitting? Try performing [cross-validation](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f) to answer this question.
> * (Optional) If you find yourself with extra time, come back to this exercise and tweak the variables you use to see if you can improve the accuracy of the tree. Try for example adding Year, Month or other variables you think may be relevant.

In [2]:
#Loading data
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\Users\andri\Documents\Andri Geir\DTU\Social Data\Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv")

df= df[(pd.to_datetime(df['Date']) >= '01/01/2003')] 
df = df[(pd.to_datetime(df['Date']) <= '31/12/2017')] 
df['Year'] = pd.DatetimeIndex(df['Date']).year
df['Month'] = pd.DatetimeIndex(df['Date']).month
df['Day'] = pd.DatetimeIndex(df['Date']).day
df['Hour'] = [int(time[0:2]) for time in (df['Time'])]


In [3]:
df_filtered1 = df[df["Category"]==("VEHICLE THEFT")].sample(40000)
df_filtered2 = df[df["Category"]==("FRAUD")].sample(40000)
print(len(df_filtered1))
print(len(df_filtered2))
df_balanced = pd.concat([df_filtered1,df_filtered2]).sample(frac=1).reset_index()
print(len(df_balanced))

40000
40000
80000


In [4]:
df_balanced

Unnamed: 0,index,PdId,IncidntNum,Incident Code,Category,Descript,DayOfWeek,Date,Time,PdDistrict,...,"Areas of Vulnerability, 2016 2 2",Central Market/Tenderloin Boundary 2 2,Central Market/Tenderloin Boundary Polygon - Updated 2 2,HSOC Zones as of 2018-06-05 2 2,OWED Public Spaces 2 2,Neighborhoods 2,Year,Month,Day,Hour
0,1187079,15108197707021,151081977,7021,VEHICLE THEFT,STOLEN AUTOMOBILE,Tuesday,12/15/2015,10:45,RICHMOND,...,1.0,,,,,5.0,2015,12,15,10
1,1103969,13102687407021,131026874,7021,VEHICLE THEFT,STOLEN AUTOMOBILE,Wednesday,12/04/2013,16:00,NORTHERN,...,1.0,,,,,15.0,2013,12,4,16
2,752762,11006813407021,110068134,7021,VEHICLE THEFT,STOLEN AUTOMOBILE,Friday,01/21/2011,17:30,PARK,...,1.0,,,,,97.0,2011,1,21,17
3,1873745,6091038609320,60910386,9320,FRAUD,"CREDIT CARD, THEFT BY USE OF",Friday,08/25/2006,20:45,SOUTHERN,...,1.0,,,,,108.0,2006,8,25,20
4,1089519,5106614007025,51066140,7025,VEHICLE THEFT,STOLEN TRUCK,Tuesday,09/20/2005,10:30,TARAVAL,...,1.0,,,,,109.0,2005,9,20,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79995,104966,12037731707021,120377317,7021,VEHICLE THEFT,STOLEN AUTOMOBILE,Saturday,05/12/2012,00:01,SOUTHERN,...,1.0,,,,,32.0,2012,5,12,0
79996,1911726,12014221707021,120142217,7021,VEHICLE THEFT,STOLEN AUTOMOBILE,Monday,02/20/2012,09:15,INGLESIDE,...,1.0,,,,,90.0,2012,2,20,9
79997,1860148,13024151409320,130241514,9320,FRAUD,"CREDIT CARD, THEFT BY USE OF",Monday,03/18/2013,00:01,PARK,...,1.0,,,,,113.0,2013,3,18,0
79998,1730407,4084563407025,40845634,7025,VEHICLE THEFT,STOLEN TRUCK,Saturday,07/24/2004,23:30,SOUTHERN,...,2.0,1.0,1.0,1.0,,32.0,2004,7,24,23


In [5]:
X = pd.get_dummies(df_balanced[["PdDistrict","Year","Month","Day","Hour"]])

y = df_balanced["Category"]

In [6]:
y = y.factorize( ['VEHICLE THEFT', 'FRAUD'] )[0]

In [7]:
X[:0]

Unnamed: 0,Year,Month,Day,Hour,PdDistrict_BAYVIEW,PdDistrict_CENTRAL,PdDistrict_INGLESIDE,PdDistrict_MISSION,PdDistrict_NORTHERN,PdDistrict_PARK,PdDistrict_RICHMOND,PdDistrict_SOUTHERN,PdDistrict_TARAVAL,PdDistrict_TENDERLOIN


In [8]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(X, y, test_size = 0.25, random_state = 42)

In [9]:
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels);

In [10]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
error  = abs(test_labels-predictions.round()).sum()
(len(test_labels)-error)/len(test_labels)

Mean Absolute Error: 0.37 degrees.


In [11]:

error  = abs(test_labels-predictions.round()).sum()
(len(test_labels)-error)/len(test_labels)

0.6802

## Part 3: Beyond the Baseline with Weather

In Part 2, you built a Decision Tree/Random Forest classifier to predict the category of a crime with the help of our friend `sklearn` using the following variables:

* `Hour of the week` (`1 , 2, ..., 168 `). 
* `PD District` (`TENDERLOIN`, etc).  (**Remember**, You'll need to encode this  labels as integers in `sklearn`, you can just assign numbers to the labels with something like sklearn's `Label Encoder` or do your own custom function). 

That model from Part 2 will function as our baseline. Now that we have it set up, we can use it to understand how adding variables from a **weather dataset** will influence the decisions of the tree later on.

Time to get that weather data rolling. The raw data we are using can be found online [here](https://www.meteoblue.com/en/weather/archive/export/san-francisco_united-states-of-america_5391959) or [you can get a convenient version from the files folder our class repository by clicking here](https://raw.githubusercontent.com/suneman/socialdata2021/master/files/weather_data.csv). 

> *Exercise*
> 
> * Load the weather dataset. If you have your training data and test data on separate `DataFrames` then merging them with the weather information should be simple 
>   * **Hint**: you can use the join method from pandas. To do so, you will need to round the time to the hour because weather data is recorded hourly. Also it's fine to drop missing values. Here's a [stackoverflow post](https://stackoverflow.com/questions/36292959/pandas-merge-data-frames-on-datetime-index) which may help you. 
>  * *Note*: you'll need to do some encoding on the weather data as before if you want to use the weather column. Also, check if all of the entries of the new training data have indeed a weather part to them. 
> * Now that you have the data properly merged, you can **fit a new random forest on the data and compare the results**. How does the weather data influence the prediction performances? (Use the evaluation function you built above.) Is there as impact in the accuracy of predictions? Is weather data relevant for the predictions?
> * *Optional*: Try experimenting with using only certain variables of the weather data. Can you improve the performance of classification by using fewer features/variables?


**Note**: It's not 100% given that adding weather will improve your predictive performance. It can go either way depending on the details of your implementation. The important thing is not performance, but that you implement your code in the right way.


In [8]:
#df_weather = df = pd.read_csv(r"C:\Users\andri\Documents\Andri Geir\DTU\Social Data\weather_data.csv")
weather = pd.read_csv("weather_data.csv", parse_dates=["date"],
                date_parser=lambda x: pd.to_datetime(x).tz_convert(None).tz_localize("Etc/GMT+3").tz_convert("Etc/GMT-7")) 
# parse_dates specifies what columns contain dates (instead of a string column -> it becomes a date_time column)
# data_parser -> we specify our custom date_parser (Pandas has default data_parser, usually we do not need to specify it)
# in our data_parser we use "lambda" function - it means that we want to apply something to each value in the column
# pd.to_datetime(x) - converts each value to date_time obect. By default pd.to_datetime assigns GMT0 timezone, 
# which is wrong, thus, we specification of timezone with tz_convert(None)
# now we want to specify the correct timezone -> we use tz_localize("..")
# after we can convert dates to the actual SanFrancisco timezone with tz_convert("..")
weather.head()

Unnamed: 0,date,temperature,humidity,weather,wind_speed,wind_direction,pressure
0,2012-10-01 23:00:00+07:00,16.33,88.0,light rain,2.0,150.0,1009.0
1,2012-10-02 00:00:00+07:00,16.324993,87.0,sky is clear,2.0,147.0,1009.0
2,2012-10-02 01:00:00+07:00,16.310618,86.0,sky is clear,2.0,141.0,1009.0
3,2012-10-02 02:00:00+07:00,16.296243,85.0,sky is clear,2.0,135.0,1009.0
4,2012-10-02 03:00:00+07:00,16.281869,84.0,sky is clear,2.0,129.0,1009.0


In [20]:
df = pd.read_csv(r"C:\Users\andri\Documents\Andri Geir\DTU\Social Data\Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv", usecols=["Category", "Date", "Time", "PdDistrict"]) ## specify any columns you need
print("df")
df = df[df["Category"].isin(['VEHICLE THEFT', 'FRAUD'])] # filter out the dataframe, you can plug any list of crimes
df["datetime"] = df.apply(lambda x: pd.to_datetime(x.Date + " " + x.Time).round("H").tz_localize("ETC/GMT-7"), axis = 1)  
df['Year'] = pd.DatetimeIndex(df['Date']).year
df['Month'] = pd.DatetimeIndex(df['Date']).month
df['Day'] = pd.DatetimeIndex(df['Date']).day
df['Hour'] = [int(time[0:2]) for time in (df['Time'])]
# Here we do a bit more complicated thing
# .apply allows us to use function for each row of a dataframe (read documentation for more info)
# so we take a row (which is x) and take cell of Date and Time -> and concatenate them to one big string
# that can be then converted to datetime. We would also want to remove any seconds and minutes (round to hours)
# then we specify that dates are in GMT-7
# the result is going to be stored in new "datetime" column

#it might take some time

# now you  can merge two datasets
# df_balanced["datetime"]

df


In [25]:
df_weather = pd.merge(df,weather,how="left",left_on ="datetime",right_on ="date")
df_weather = df_weather.dropna(subset=["date"])

In [31]:
len(df_weather[df_weather["Category"]==("VEHICLE THEFT")])
len(df_weather[df_weather["Category"]==("FRAUD")])

14301

In [32]:
df_filtered1 = df_weather[df_weather["Category"]==("VEHICLE THEFT")].sample(14301)
df_filtered2 = df_weather[df_weather["Category"]==("FRAUD")].sample(14301)
print(len(df_filtered1))
print(len(df_filtered2))
df_balanced = pd.concat([df_filtered1,df_filtered2]).sample(frac=1).reset_index()
print(len(df_balanced))

14301
14301
28602


In [36]:
df_balanced.columns

Index(['index', 'Category', 'Date', 'Time', 'PdDistrict', 'datetime', 'Year',
       'Month', 'Day', 'Hour', 'date', 'temperature', 'humidity', 'weather',
       'wind_speed', 'wind_direction', 'pressure'],
      dtype='object')

In [37]:
X = pd.get_dummies(df_balanced[['PdDistrict', 'Year',
       'Month', 'Day', 'Hour', 'temperature', 'humidity', 'weather',
       'wind_speed', 'wind_direction', 'pressure']])

y = df_balanced["Category"]
y = y.factorize( ['VEHICLE THEFT', 'FRAUD'] )[0]

In [38]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(X, y, test_size = 0.25, random_state = 42)

In [39]:
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels);

In [40]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
#print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
error  = abs(test_labels-predictions.round()).sum()
(len(test_labels)-error)/len(test_labels)

Mean Absolute Error: 0.4 degrees.


0.6786463431687876

## Part 4: Video Lectures and Reading

Next week we'll be playing around with *explanatory data visualization*. Roughly speaking this means using data visualization to communicate your results to others. Thus, there are new things to think about. We'll start thinking about that already this week.

We start with a video from from yours truly and then read a bit from a scientific article about types of explanatory dataviz. (*The video is from an old version of the class that used D3, so just ignore those parts. I'll make a new one ASAP*).

[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/yHKYMGwefso/0.jpg)](https://www.youtube.com/watch?v=yHKYMGwefso)

In [None]:
# # Sune talks about designing visualizations.
# from IPython.display import YouTubeVideo
# YouTubeVideo("yHKYMGwefso",width=600, height=338)

> *Exercises*: Explanatory data visualization
> * What are the three key elements to keep in mind when you design an explanatory visualization?
> * In the video I talk about (1) *overview first*,  (2) *zoom and filter*,  (3) *details on demand*. 
>   - Go online and find a visualization that follows these principles (don't use one from the video). 
>   - Explain how it does achieves (1)-(3). It might be useful to use screenshots to illustrate your explanation.
> * Explain in your own words: How is explanatory data analysis different from exploratory data analysis?

*Reading*: [Narrative Visualization: Telling Stories with Data](http://vis.stanford.edu/files/2010-Narrative-InfoVis.pdf) by Edward Segel and Jeffrey Heer. We'll read section 1-3 today. (And the rest next time).

When you get to section 3 it's fun to open up the examples mentioned by the authors in a browser and explore them as you read the text. 

> *Exercise*: Answer a couple of questions about the paper.
> 
> * What is the *Oxford English Dictionary's* defintion of a narrative?
> * What is your favorite visualization among the examples in section 3? Explain why in a few words.