# 1. Motivation
## 1.1 Dataset
The main dataset was the NYC Street Data from 2015 with seconday data sets from the previous year, 1995 and 2005. The Street Tree data set was chosen because it was a rather odd topic which could give new insights and perspective to urban planning, that most people would probably not be aware of beforehand.  

## 1.2 Goal
The goal was to inlighten users about trees in NYC. Are there trees more suitable for streets than others? Where are they located? Is it possible to know which kind of tree you might encounter based on the location, health of the tree, the diameter, or even the amount of problems of the tree? From this project it should be possible to learn something new about a topic you might never have considered learning something about.

# 2. Basic stats
## 2.1 Preprocessing the data
### 2.1.1 Variable selection
When taking a first glance at the dataset it was a bit overwhelming. There were lots of variables not interesting or necesseray. Each variable were carefully examined and the variables deemed unnecessary were excluded. Among these was "Tree_Id", a unique ID for each tree, but this unique ID was unique for each of the three datasets (1995, 2005, 2015), meaning it was not possible to join the datasets by this ID, deeming it not relevant. Other excluded variables were address information, since there were multiple variables delivering address information on different levels - and it was not relevant to distinguish between all these. 

### 2.1.2 Observation selection
It was decided to only focus on the top 20 tree species, since there were a lot of different species without a significant amount of observations, it would be difficult to describe them all properly. It would also be very difficult to do good predictions if the observations are sparse. 

There were a lot of trees without a species listed, and those were disregarded completely. The dead trees were also excluded from the dataset. 

It was considered to only focus on one of the five boroughs in NYC to get a more detailed view. This was not implemented since it was deemed more interesting to two differences between the boroughs as well. 


## 2.2 Stats for the preprocessed data
The final dataset "Street Tree Data 2015" consists of 534,514 tree observations and 21 variables/features, totalling 74.5 MB.
The selected features were: 
* Diameter (inches)
* Health (three values: Good, fair, poor)
* Spc_Latin, Spc_Common (latin and common name for the species) 
* Sidewalk_Condition (two values: Damage, NoDamage)
* problems (a string concatenated from the following types of problems)
 * root_stone, root_grate, root_other, trunk_wire, trnk_light, trnk_other, brch_light, brch_shoe, brch_other (two values: yes/no)
* Address
* Zipcode
* CB (community board)
* Borough
* Latitude, Longitude

Amount of trees in each borough:
- Bronx: 63,035
- Brooklyn: 138,760
- Staten Island: 82,619
- Manhattan: 54,115
- Queens: 195,985

In general, the top 20 species were the same for the 5 boroughs, but the order of this "top 20" list were different. There were more trees with problems in Manhattan. 

### 2.2.1 More observations in the data 
### <font color='red'> DANIELE: WRITE SOMETHING ABOUT WHAT YOU FOUND</font>

## 2.3 Other datasets inspected
Multiple secondary datasets were inspected, e.g. the 311 dataset. In this, there was several complaints about trees in NYC. No significant correlations were found. It was hoped that a connected between a certain type of complaint were correlated with different problems or the health of the tree, but unfortunately data does not always behave as hoped or suspected, and patterns cannot (and should not) be forced to appear. 

One could also be inclined to wonder if more "green" areas, meaning areas with a lot of trees, had higher house prices. Again, after investigation, this was found to be challenging, since there is not a lot of information about house prices available - at least not on a neighbourhood level. 

It was also considered if there was a correlation between the trees/features of the trees and the air pollution. This dataset was used for simple linear regression. 


# 3. Theory

## 3.1 Machine Learning tools
When doing predictions it can be difficult to find the appropriate tools to use. Different tools have different qualities and it all depends on the data and the patterns in your data. In this project, different tools have been tried out, typically multiple tools for the same prediction to inspect the model performance of each tool. 

### 3.1.1 KNN
KNN is a tool rather easy to grasp and implement. It was chosen for predicting the health of a tree based on GPS coordinates, as well as predicting species based on GPS coordinates. An argument for KNN being the most appropriate choice is that one could think that when planting trees, one would be inclined to plant the same trees together. One could also think that unhealthy trees are likely in the same area, presumable because of a decease in the area, a pollution problem, soil problems or something completely different. A drawback of the KNN method is that when dealing when an unbalanced dataset it will favour the most occuring observation. 

### 3.1.2 Decision trees
Decision trees can often be a good choice because they are nice to visualize. A drawback is that they tend to overfit the training data. It was used for predicting health based on GPS coordinates, as well as species based on GPS coordinates in spite of its drawback. When predicting species different features were added to see if they contributed to the predictions, e.g. the diameter. The main reason was to compare with the other results. If the decision trees did not overfit and still performed well, then it would be nice to visualize. To accomodate the overfitting issue a random forest was also tried out.

Decision trees were also used to predict diameter based on species and problems, as well as predicting diamater based on the amount of problems. Here, the diameter was binned in bins of different sizes (1-10, 10-15, 15-20, ..., 45-50, 50-60, 60-70, ..., 90-100, 100-150, 150-200, ...)

### 3.1.3 SVM
As a third tool, Support Vector Machines were tried out. SVMs can do linear classification by creating a "maximum seperating hyperplane" between data. It can also do non-linear classification using a so-called kernel-trick where inputs are mapped to high-dimensional feature space. This was used to predict health based on GPS coordinates. 

### 3.1.4 Apriori
Apriori is an algorithm for frequent item search. It was used to inspect problems appearing together.
<font color='red'> DANIELE: ADD MORE?</font>

### 3.1.5 Linear regression
Linear regression was used to inspect correlation between different features, and is not really a machine learning tool as much as a tool for investigating linear correlations. It was used to predict air pollution based on the amount of trees as well as diameter. 

## 3.2 Model selection
When selecting appropriate models, first thing is to split the data into a training set and a test set. 
When predicting health (or species) based on GPS coordinates, a test set consisting of 15% of the total amount of observations was used. Hereafter the training set was "split" into a training set and a validation set, using a 5-fold cross-validation. The best model was chosen based on accuracy scored, but with computation time taken into account as well. For the KNN, different values of $k$ was tried, ranging from $K=2,...,10$. The limit was set to 10 because it was not expected that we would have a whole area of unhealthy trees, and that might just confuse the predictions. 

### <font color='red'> DANIELE: MISSING INFO - WHAT DID YOU Do?</font>

## 3.3 Model performance
For predicting species and health based on GPS coordinates, KNN was selected as the best model. SVM simply took a significant amount of time to run, making it difficult to fine tune and handle. Decision trees overfitted the training data and was not good at handling sparse data. When predicting health, the decision tree classifier took the easy choice, and predicted "good" no matter what, which is, of course the best "random" guess, since most trees were in good condition. The KNN did handle the sparsity of the "fair" and "poor" trees better. When the KNN classifier labeled a tree as "fair" it had around 50% correct and 50% incorrect. The same went for the "poor" classifications, whereas it, as we thought, was much better at predicting the good trees. This makes sense since there were a lot more training data available. 
### <font color='red'>CECILIE: WRITE TEST RESULTS (RUN AGAIN FOR WHOLE NYC)</font>

### <font color='red'>DANIELE: WRITE SOMETHING ABOUT YOUR MODELS PERFORMANCE</font>

# 4. Visualizations
### <font color='red'> CECIIE: MISSING</font>

# 5. Discussion
### <font color='red'> CECILIE: MISSING</font>