# Assignment 2.

## Formalia:

Please read the [assignment overview page](https://github.com/suneman/socialdata2021/wiki/Assignment-1-and-2) carefully before proceeding. This page contains information about formatting (including formats etc), group sizes, and many other aspects of handing in the assignment. 

_If you fail to follow these simple instructions, it will negatively impact your grade!_

**Due date and time**: The assignment is due on Monday April 5th, 2021 at 23:55. Hand in your files via [`http://peergrade.io`](http://peergrade.io/).

**Peergrading date and time**: _Remember that after handing in you have a week to evaluate a few assignments written by other members of the class_. Thus, the peer evaluations are due on Monday April 12th, 2021 at 23:55. 

## Part 1: Questions to text and lectures.

A) Please answer my questions to the Segal and Heer paper we read during lecture 7 and 8.

* What is the *Oxford English Dictionary's* defintion of a narrative?
* What is your favorite visualization among the examples in section 3? Explain why in a few words.
* What's the point of Figure 7?
* Use Figure 7 to find the most common design choice within each category for the Visual narrative and Narrative structure (the categories within visual narrative are 'visual structuring', 'highlighting', etc).
* Check out Figure 8 and section 4.3. What is your favorite genre of narrative visualization? Why? What is your least favorite genre? Why?


B) Also please answer the questions to my talk on [explanatory data visualization](https://www.youtube.com/watch?v=yHKYMGwefso)

* What are the three key elements to keep in mind when you design an explanatory visualization?
* In the video I talk about (1) *overview first*,  (2) *zoom and filter*,  (3) *details on demand*. 
  - Go online and find a visualization that follows these principles (don't use one from the video). 
  - Explain how it does achieves (1)-(3). It might be useful to use screenshots to illustrate your explanation.
* Explain in your own words: How is explanatory data analysis different from exploratory data analysis?

## Part 2: Random forest and weather

The aim here is to recreate the work you did in Part 1-3 of the Week 7 lecture. I've phrased things differently relative to the exercise to make the purpose more clear. 

Part 2A: Random forest binary classification. 
* Using the and instructions and material from Week 7, build a *random forest* classifier to distinguish between two types (you choose) of crime using on spatio-temporal (where/when) features of data describing the two crimes. When you're done, you should be able to give the classifier a place and a time, and it should tell you which of the two  types of crime happened there.
  - Explain about your choices for training/test data, features, and encoding. (You decide how to present your results, but here are some example topics to consider: Did you balance the training data? What are the pros/cons of balancing? Do you think your model is overfitting? Did you choose to do cross-validation? Which specific features did you end up using? Why? Which features (if any) did you one-hot encode? Why ... or why not?))
  - Report accuracy. Discuss the model performance.
  
  
Part 2B: Info from weather features.
* Now add features from weather data to your random forest. 
  - Report accuracy. 
  - Discuss how the model performance changes relative to the version with no weather data.
  - Discuss what you have learned about crime from including weather data in your model.

In [14]:
import pandas as pd
import matplotlib.pyplot as plt

crimes = pd.read_csv("../incidents.csv") 

In [42]:
import numpy as np 
from datetime import datetime

# Burglary samples: 91067,  Fraud samples: 41348

focuscrimes = ["BURGLARY", "FRAUD"]

crimes = crimes[crimes["Category"].isin(focuscrimes)]

crimes["Date_Time"] = pd.to_datetime(crimes["Date"] + " " + crimes["Time"])
crimes["Year"] = crimes["Date_Time"].dt.year
crimes["Month"] = crimes["Date_Time"].dt.month
crimes["Hour"] = crimes["Date_Time"].dt.hour

In [76]:
crimes_in_range = crimes[crimes["Year"].between(2012, 2017, inclusive=True)]
burglary = crimes_in_range[crimes_in_range["Category"].isin([focuscrimes[0]])]
fraud = crimes_in_range[crimes_in_range["Category"].isin([focuscrimes[1]])]

print(burglary.shape)
print(fraud.shape)

crimes_in_range.head()

(35912, 39)
(16703, 39)


Unnamed: 0,PdId,IncidntNum,Incident Code,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,...,"Areas of Vulnerability, 2016 2 2",Central Market/Tenderloin Boundary 2 2,Central Market/Tenderloin Boundary Polygon - Updated 2 2,HSOC Zones as of 2018-06-05 2 2,OWED Public Spaces 2 2,Neighborhoods 2,Date_Time,Year,Month,Hour
5,13085582009320,130855820,9320,FRAUD,"CREDIT CARD, THEFT BY USE OF",Tuesday,10/08/2013,21:11,PARK,NONE,...,1.0,,,,,13.0,2013-10-08 21:11:00,2013,10,21
68,13007111705041,130071117,5041,BURGLARY,"BURGLARY OF RESIDENCE, FORCIBLE ENTRY",Friday,01/25/2013,07:45,PARK,NONE,...,1.0,,,,,112.0,2013-01-25 07:45:00,2013,1,7
93,12020159005073,120201590,5073,BURGLARY,"BURGLARY, UNLAWFUL ENTRY",Monday,03/05/2012,10:34,RICHMOND,NONE,...,2.0,,,,,8.0,2012-03-05 10:34:00,2012,3,10
133,13083397305071,130833973,5071,BURGLARY,"BURGLARY, FORCIBLE ENTRY",Tuesday,10/01/2013,08:00,NORTHERN,NONE,...,1.0,,,,,102.0,2013-10-01 08:00:00,2013,10,8
162,13090015205013,130900152,5013,BURGLARY,"BURGLARY OF APARTMENT HOUSE, UNLAWFUL ENTRY",Tuesday,10/22/2013,15:35,NORTHERN,"ARREST, BOOKED",...,1.0,,,,,105.0,2013-10-22 15:35:00,2013,10,15


In [81]:
sample_size = 15000

# Create balanced data set
type1 = burglary.sample(sample_size)
type2 = fraud.sample(sample_size)

crime_df = pd.concat([type1, type2], ignore_index=True)

In [82]:
features = crime_df[["Category", "DayOfWeek", "Month", "Hour", "PdDistrict"]]

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
features["Category"] = le.fit_transform(features["Category"])

# One-hot encode the categorical data
features = pd.get_dummies(features, columns=["DayOfWeek", "PdDistrict"])

# Labels will be the values we want to predict
labels = np.array(features["Category"])

# We remove the labels from the crime dataframe to get all the values we need for the features
features = features.drop('Category', axis=1)

# We save the feature names for later
feature_list = list(features.columns)

# Convert the dataframe to a numpy array so we can work with the features
features = np.array(features)

# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)

print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (22500, 19)
Training Labels Shape: (22500,)
Testing Features Shape: (7500, 19)
Testing Labels Shape: (7500,)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features["Category"] = le.fit_transform(features["Category"])


In [57]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error

no_max_classifier = RandomForestClassifier(n_estimators=99, random_state=42)
no_max_classifier.fit(train_features, train_labels)

classifier = RandomForestClassifier(n_estimators=99, random_state=42, max_depth=3)
classifier.fit(train_features, train_labels)

print("No Max Average Tree Depth:", np.mean([estimator.get_depth() for estimator in no_max_classifier.estimators_]))

no_max_train_predictions = no_max_classifier.predict(train_features)
print("Training")
print('Mean Absolute Error:', round(mean_absolute_error(train_labels, no_max_train_predictions), 2), 'degrees.')
print("Accuracy: ", 100 * no_max_classifier.score(train_features, train_labels), "%\n")

no_max_predictions = no_max_classifier.predict(test_features)
print("Testing")
print('Mean Absolute Error:', round(mean_absolute_error(test_labels, no_max_predictions), 2), 'degrees.')
print("TeAccuracy: ", 100 * no_max_classifier.score(test_features, test_labels), "%\n")


print("Max Depth 3")

train_predictions = classifier.predict(train_features)
print("Training")
print('Mean Absolute Error:', round(mean_absolute_error(train_labels, train_predictions), 2), 'degrees.')
print("Accuracy: ", 100 * classifier.score(train_features, train_labels), "%\n")

predictions = classifier.predict(test_features)
print("Testing")
print('Mean Absolute Error:', round(mean_absolute_error(test_labels, predictions), 2), 'degrees.')
print("TeAccuracy: ", 100 * classifier.score(test_features, test_labels), "%")

No Max Average Tree Depth: 29.96969696969697
Training
Mean Absolute Error: 0.17 degrees.
Accuracy:  82.87555555555556 %

Testing
Mean Absolute Error: 0.41 degrees.
TeAccuracy:  59.17333333333333 %

Max Depth 3
Training
Mean Absolute Error: 0.4 degrees.
Accuracy:  59.96444444444444 %

Testing
Mean Absolute Error: 0.4 degrees.
TeAccuracy:  60.160000000000004 %


## Part 2A

**Did you balance the training data? What are the pros/cons of balancing?**

The dataset is balanced with 20000 randomly picked samples from each crime category, as to ensure the crime are distributed equally over time with no favor of one over the other.

**Do you think your model is overfitting?**

Initially where the classifier had no maximum depth, the training accuracy was near 89% where test accuracy was at 52-53%. Together with a avg. tree depth of around 45 of a dataset with 18 feautures, it seems safe to assume that the model was overfitting, as it clearly shows it did not generalize well from the training data to the testing data.

However, with a maximum depth of 3, a higher accuracy is reached but with a drastical smaller tree size, which could indicate a better fitted model.

**Did you choose to do cross-validation?**

To error estimate the classifier, the Holdout Method is used by creating training and testing/validation datasets. The testing datasets are then used to calculate the mean accuracy of the classifier.

**Which specific features did you end up using? Why?**

The features used are "DayOfWeek", "Date", "Time", and "PdDistrict", because they tell something about the time and place of the crime.

**Which features (if any) did you one-hot encode? Why ... or why not?))**

The features to be one-hot encoded was "DayOfWeek" and "PdDistrict", where the crime category was just label encoded. Both "DayOfWeek" and "PdDistrict" includes categorical variables that should be converted to binary data which the machine can understand without preferring one over the other, why Pandas' get_dummies function is used.

Because the crime category is what should be predicted, these are not converted to binary data, but are just given a numeric representation using Sklearn's LabelEncoder.

The "Date" and "Time" features are also kind of included. When the raw crime data is loaded into a dataframe, the columns are just treated as strings. We want to use them to determine how time influceses the crimes. To make the machine understand this, however, the columns are merged together to a datetime column "Date_Time" where the dates are converted to their ordinal numeric values.
At the time this seemed smart, but after doing some thinking, it would probably have been better to split the date times up into something like; year, month, day, hour, minute or something, as humans, and therefor crimes, follow more patterns of our gregorian calendar rather than a UNIX timestamp...

**Report accuracy. Discuss the model performance.**

Well, the accuracy is around 57% for the Random Forest classifier, an 14 % better accuracy of the baseline of 50/50. This is probably not good enough for any practical application.

In [70]:
weather = pd.read_csv("../weather_data.csv")

# Format date and time for easy processing and training
weather["Date_Time"] = pd.to_datetime(weather["date"])
weather["Year"] = weather["Date_Time"].dt.year
weather["Month"] = weather["Date_Time"].dt.month
weather["Hour"] = weather["Date_Time"].dt.hour

# Fix naming
weather["Temperature"] = weather["temperature"]
weather["Humidity"] = weather["humidity"]
weather["Wind_Direction"] = weather["wind_direction"]
weather["Wind_Speed"] = weather["wind_speed"]
weather["Weather"] = weather["weather"]
weather["Pressure"] = weather["pressure"]

# Drop the columns we don't need
weather = weather.drop(["Date_Time", "temperature", "weather", "pressure", "humidity", "wind_direction", "wind_speed", "date"], axis=1)

# One-hot encode the categorical data
weather_df = pd.get_dummies(weather, columns=["Weather"])

In [71]:
# Let's merge the weather and crime dataframes together!

merged = pd.merge(crime_df, weather_df, how="left", on=["Month", "Hour"])

In [74]:
print("Crimes: ", crime_df.shape)
print("Merged: ", merged.shape)
merged.head()

Crimes:  (30000, 19)
Merged:  (4619871, 52)


Unnamed: 0,Month,Hour,DayOfWeek_Friday,DayOfWeek_Monday,DayOfWeek_Saturday,DayOfWeek_Sunday,DayOfWeek_Thursday,DayOfWeek_Tuesday,DayOfWeek_Wednesday,PdDistrict_BAYVIEW,...,Weather_scattered clouds,Weather_shower rain,Weather_sky is clear,Weather_smoke,Weather_squalls,Weather_thunderstorm,Weather_thunderstorm with heavy rain,Weather_thunderstorm with light rain,Weather_thunderstorm with rain,Weather_very heavy rain
0,3,0,0,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
1,3,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,3,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,3,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,3,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


## Part 3: Data visualization

* Create the Bokeh visualization from Part 2 of the Week 8 Lecture, displayed in a beautiful `.gif` below. 
* Provide nice comments for your code. Don't just use the `# inline comments`, but the full Notebook markdown capabilities and explain what you're doing.

![Movie](https://github.com/suneman/socialdataanalysis2020/blob/master/files/week8_1.gif?raw=true "movie")