In [None]:
# The usual preamble
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
plt.matplotlib.rcParams['savefig.dpi'] = 144
import seaborn

# Chapter 2: Selecting Data & Finding Complaint Types

## Section 1

We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is a subset of the of 311 service requests from 2001 from [NYC Open Data](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9). 

Today we are also going to read csvs from **the internet** as to not have you upload files.

Any link that can do "copy link" when you want to download, you can put into a csv using Pandas.

In [None]:
url="https://data.cityofnewyork.us/resource/fhrw-4uyv.csv"
complaints=pd.read_csv(url)
complaints.head(5)

In [None]:
complaints = pd.read_csv('../data/311-service-requests.csv')

# 2.1 What's even in it? (the summary)

When you look at a large dataframe, instead of lookig through the whole dataframe, you can ask pandas to tell you about it. 

In [None]:
complaints.dtypes

## Changing the size of the dataset

you can also use **.drop()** to reduce the size of the dataset

`df = df.drop("boat",1).drop("body",1).drop("home.dest",1)`
    
also you can use the **del** comand to delete columns 

`del df['cabin'] `

## Cleaning a Dataset

You can also use **.dropna()** to drop empty values

`.dropna()`

`dropna(how="any", subset=['embarked'])`

You can also check for Null Values

`.isnull().sum()`

and also fill them in!!

`.fillna(np.nan)`







## Describing a Dataset 

using **.describe() **

Why are only some of the columns showing up?

# 2.2 Selecting columns and rows

To select a column, we index with the name of the column, like this:

In [None]:
complaints['Complaint Type']

To get the first 5 rows of a dataframe, we can use a slice: `df[:5]`.

This is a great way to get a sense for what kind of information is in the dataframe -- take a minute to look at the contents and get a feel for this dataset.

In [None]:
complaints[:5]

We can combine these to get the first 5 rows of a column:

In [None]:
complaints['Complaint Type'][:5]

and it doesn't matter which direction we do it in:

In [None]:
complaints[:5]['Complaint Type']

# 2.3 Selecting multiple columns

What if we just want to know the complaint type and the borough, but not the rest of the information? Pandas makes it really easy to select a subset of the columns: just index with list of columns you want.

In [None]:
complaints[['Complaint Type', 'Borough']]

That showed us a summary, and then we can look at the first 10 rows:

In [None]:
complaints[['Complaint Type', 'Borough']][:10]

# 2.4 What's the most common complaint type?

This is a really easy question to answer! There's a `.value_counts()` method that we can use:

In [None]:
complaints['Complaint Type'].value_counts()

If we just wanted the top 10 most common complaints, we can do this:

In [None]:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]

But it gets better! We can plot them!

In [None]:
complaint_counts[:10].plot(kind='bar')

<style>
    @font-face {
        font-family: "Computer Modern";
        src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');
    }
    div.cell{
        width:800px;
        margin-left:16% !important;
        margin-right:auto;
    }
    h1 {
        font-family: Helvetica, serif;
    }
    h4{
        margin-top:12px;
        margin-bottom: 3px;
       }
    div.text_cell_render{
        font-family: Computer Modern, "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;
        line-height: 145%;
        font-size: 130%;
        width:800px;
        margin-left:auto;
        margin-right:auto;
    }
    .CodeMirror{
            font-family: "Source Code Pro", source-code-pro,Consolas, monospace;
    }
    .text_cell_render h5 {
        font-weight: 300;
        font-size: 22pt;
        color: #4057A1;
        font-style: italic;
        margin-bottom: .5em;
        margin-top: 0.5em;
        display: block;
    }
    
    .warning{
        color: rgb( 240, 20, 20 )
        }  

## Section 2

Now that you've worked through that with the **NYC data**, let's look at **DC 311 data from 2016**.

The link you want to us is: "https://opendata.arcgis.com/datasets/0e4b7d3a83b94a178b3d1f015db901ee_7.csv"

You can find the data dictonairy at the following link [here](http://opendata.dc.gov/datasets/city-service-requests-in-2016?geometry=-78.075%2C38.708%2C-76.222%2C39.082)

The following cell is for your work space. Feel free to add more cells if needed as you repeat the analysis