# Dogs of NYC
## Tiber Training November 17th, 2017

WNYP provides a great resource of [dogs of New York City names](https://project.wnyc.org/dogs-of-nyc/) that seems like a great place to start our analysis!

**[Pandas](http://pandas.pydata.org/)** is the package we're going to be using the most of!

##### Pandas? Pandas!

![Image of red panda](https://i.pinimg.com/736x/57/49/cb/5749cb63e52dd8ce3a0376ddd185cdaf--adorable-pets-baby-animals-adorable.jpg)

>"What problem does pandas solve?

>Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

>Combined with the excellent IPython toolkit and other libraries, the environment for doing data analysis in Python excels in performance, productivity, and the ability to collaborate.

>pandas does not implement significant modeling functionality outside of linear and panel regression; for this, look to statsmodels and scikit-learn. More work is still needed to make Python a first class statistical modeling environment, but we are well on our way toward that goal."


# Pandas Library Highlights

![Image of panda library](https://farm8.staticflickr.com/7397/12824365945_bca225debe_z.jpg)


> - A fast and efficient DataFrame object for data manipulation with integrated indexing;
> - Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
> - Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
> - Flexible reshaping and pivoting of data sets;
> - Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
> - Columns can be inserted and deleted from data structures for size mutability;
> - Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
> - High performance merging and joining of data sets;
> - Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
> - Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;

### Okay, let's start importing things:

matplotlib is a great plotting library

`plt.style.use('ggplot')` adds a customised style to my plots

`% matplotlib inline` this makes sure my plots show up!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import seaborn as sns
plt.style.use('ggplot')
% matplotlib inline

# Again: Hadley Wickham's approach to EDA:  
![image of the data flow showing visualization as an exploratory and iterative process](http://benbestphd.com/images/r4ds_data-science.png)





#### The goal of EDA is to discover patterns in data. This is a fundamental stepping stone towards predictive modelling, or an end goal in itself. 

Tips for good EDA:
- Understand the context. 
- Use graphical representations of data
- Develop models in an iterative process of tentative model specification and residual assessment 
- Question the data: Who collected it? Who is distributing it? Do all of the patterns make sense to what you know about the world? If they don’t, go back and look more closely at your data.
- Don’t think of EDA as an initial step. 



You can use `pd.read_csv` to point towards files on site or on the web.

In [None]:
dogurl = 'https://raw.githubusercontent.com/aapeebles/tiber-Nov172017/master/Dogs%20of%20NYC%20-%20WNYC.csv'

df = pd.read_csv(dogurl)

## looking at the data

Top of the data is `df.head()` and you can ask for the top 20, 10, 32 by putting a number in the `()`
You see the bottom with `df.tail()`

In [None]:
df.head()

look for at the last 12 records in the dataset.

### What is we want to look at the columns without opening the file?

`df.columns`

### How can you see what type of variable each column is?

`df.dtypes()`

### Check how many total rows and columns we have

`df.shape`

### And how many of those rows are missing values

`df.isnull().sum()`

#### get rid of some duplicates by making everything lower case

using `str.lower()`

In [None]:
df.dog_name = df.dog_name.str.lower()

### check what names are most common

value_counts creates a series of the

In [None]:

df.dog_name.value_counts()[:10]


### How many dogs are there in each bourough?

### How many Unique names are there?

In [None]:
df.dog_name.nunique()

### How many Unique dog BREEDS are there?

In [None]:
#plot the top five breeds of dogs
df.breed.value_counts()[:5].plot(kind='bar')

### Crosstab

Pandas has a great function called [crosstab](https://chrisalbon.com/python/pandas_crosstabs.html)

In [None]:
#quick comparison of what breeds are neutered vs. not
pd.crosstab(df.breed,df.neutered).sort_values('Yes', ascending=False)

## How many Trained dogs are there per borough?

to figure this out, let's first convert the 'guard_or_trained' column to numbers so it's easier to sum up. 

one of the ways we can do that is to create a dictionary where we assign a number to each string

`boolean = {'No': 0, 'Yes': 1}`

we can then map those values to the 'trained' column so the string values are replaced with the numbers we coded

`df['trained'] = df.guard_or_trained.map(boolean)`


run the code and then check if it worked!

## Groupby

[Groupby](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) is a powerful aggregating tool in pandas.

In the code bellow you have the following parts:

`df.groupby('borough').trained.sum().sort_values(ascending=False)`

- **df** is the dataframe you referencing
- **groupby** is the function of pandas you are calling
- **borough** is the variable WITHIN df you are grouping by
- **trained** is what you are actually counting
- **sum** is how you are aggregating - and this is why we pulling in numpy to have the sum function
- **sort_values** tells python how you want the aggregation sorted

Okay, run the code and see what you get!


### Repeat with Neutered

Do the same steps again, but counting how many are neutered per neighborhood

Stuck on what to do next? Here's a really thorough Pandas tutorial on the different ways in which you can approach the dataset: https://pandas.pydata.org/pandas-docs/stable/tutorials.html

### Next steps:
- There are a bunch of other datasets at http://opendata.dc.gov/ that you can also look at. I've included some datasets on awards given out by ward in this repo
- Want to explore more numerical datasets? Here's a great tutorial that looks at life expectancies around the world: https://github.com/alfredessa/awesomedata.science/blob/master/2.0PandasIntro/Intro.Pandas.EDA.ipynb
- Want to dive into the visualization aspect of EDA? This isn't a tutorial as much as a walk-through of Hadley Wickham's thought process. Still worth it: https://www.youtube.com/watch?v=ZdPNBF6GWBw
- Want more datasets to dig into? Jeremy Singer-Vine from BuzzFeed puts out a regular newsletter called Data is Plural. You can sign up and access the archives here: https://tinyletter.com/data-is-plural; Kaggle (the data science competition website) also has a tonne of great (clean!) datasets available: https://www.kaggle.com/datasets. Their head of data preparation, Rachael Tatman, also has a fantastic newsletter where she shares linguistic datasets: http://rachaeltatman.com/ (she also live codes on Friday nights!)