# Week1: Pandas Introduction, DataFrame from CSV

Let's start to work with pandas, to make analysis of this data easier.  Our convention is to import as "pd":

In [None]:
import pandas as pd

# numpy is a library very useful in Python for numerical stuff
import numpy as np

# two librairies for plotting
import matplotlib.pyplot as plt
import seaborn as sns

## 1. Reading/Writing files
### 1.1. Reading

In [None]:
# Pandas has built-in tools to read files, including csv and excel.
complaints = pd.read_csv('data/311-service-requests.csv')

Depending on your pandas version, you might see an error like "DtypeWarning: Columns (8) have mixed types". This means that it's encountered a problem reading in our data. In this case it almost certainly means that it has columns where some of the entries are strings and some are NaN which are recognized as float by the funciton. 

For now we're going to ignore it and hope we don't run into a problem, but in the long run we'd need to investigate this warning.

In [None]:
complaints

In [None]:
# type of the results
type(complaints)

** The function returns a DataFrame which is an object with a lot of things defined on it **

### 1.2. Writing

*We can also save out a csv file from pandas in a simple way*


In [None]:
complaints.to_csv("data/saved_complaints.csv")

*What if we only wanted some of the columns? We can pick which ones to write out*


In [None]:
complaints.to_csv("data/saved_complaints2.csv", columns=["Created Date", "Closed Date"])

*If you don't want the extra index row with the rownumber and no header, you can prevent that from being saved by saying "index=None" and "header = None)*


In [None]:
complaints.to_csv("data/saved_complaints3.csv", columns=["Created Date", "Closed Date"], index=None, header=None)

## 2. Viewing Data
### 2.1. See the top & bottom rows of the frame 

In [None]:
complaints.head()  # this command shows us the result of the operation, with the top 5 rows of the table

In [None]:
complaints.tail(3)

### 2.2. Plotting data
*For now, there is nothing to plot in the dataset complaints. Let's use an other dataset to learn some tools about plotting*


In [None]:
df = pd.read_csv('data/goog.csv')
df.head()

In [None]:
# To avoid executing the command plt.show() every time we want to show a graph
%matplotlib inline

In [None]:
# To make a plot from pandas, we use the dataframe object we created, then tell the plot function
# what to use as the X column and what to use as the Y column.
df.plot("Date", "High",figsize=(20,10))

In [None]:
# If we plot volume, we can see how little data is there with volume, all at the end:
df.plot("Date", "Volume",figsize=(20,10))

## 3. Selection
*Let's use complaints again*
### 3.1. Get a slice from a DataFrame
#### Get a column

In [None]:
# comment/uncomment lines
complaints["Agency"]
#df.Agency
#df.loc[:,'Agency'] #selection by label
#complaints.iloc[:,3]     #selection by position

#### get a line

In [None]:
# comment/uncomment lines
complaints.iloc[1,:]
#complaints.iloc[1] #Selection by position
#complaints.loc[1,:] #Selection by Label

In [None]:
print(type(complaints["Agency"]))
print(type(complaints.iloc[1,:]))

** A Slice of shape (1,) or (,1) is not a DataFrame but a Serie. The associated functions are not exactly the same **

In [None]:
# Possible to build a DataFrame from a serie
pd.DataFrame(complaints["Agency"]).head()

#### select multiple rows and columns

Sometimes you want to combine your selection to rows and columns.  You can do that with `.loc[]` and `.iloc[]`.

`iloc[rows,colums]` is for use with **number selectors** for row and column, and `loc[rows,columns]` is for **label selectors** for row and column (if they exist).

The documentation for this is [here](http://pandas.pydata.org/pandas-docs/stable/indexing.html).

If you use these, you must put a row selector first, then a column selector:

In [None]:
#Selecting by integer is done with iloc:
# this selects the first 10 rows and the first 3 columns.
complaints.iloc[0:10, 0:3]

In [None]:
# select rows 0:5, but the columns between the named columns here:
# Note, this works with row numbers because there is no label for the rows aside from numbers.
#  Notice this command loc is "inclusive" of the end points on the range.  Meaning it includes them.
complaints.loc[0:5, 'Closed Date':'Complaint Type']

What if we just want to know the complaint type and the borough, but not the rest of the information? Pandas makes it really easy to select a subset of the columns: just index with list of columns you want, using double square brackets (Or using the `loc[]` method above):

In [None]:
# Comment/uncomment

complaints[['Complaint Type', 'Borough']]
#complaints.loc[:,['Complaint Type', 'Borough']]

### 3.2. Boolean Indexing

In [None]:
complaints[complaints['Unique Key'] > 26595140].head()

** Fonction isin()**

In [None]:
complaints[complaints['Complaint Type'].isin(['Noise - Vehicle','Noise - Street/Sidewalk'])]

**Using list of Booleans**

In [None]:
complaints['Complaint Type'] == 'Noise - Street/Sidewalk'

This is a big array of `True`s and `False`s, one for each row in our dataframe. When we index our dataframe with this array, we get just the rows where our boolean array evaluated to `True`.  It's important to note that for row filtering by a boolean array the length of our dataframe's index must be the same length as the boolean array used for filtering.

In [None]:
complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk' ]

You can also combine more than one condition with the `&` operator like this:

In [None]:
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"
in_brooklyn = complaints['Borough'] == "BROOKLYN"
# now, both must be true since we use & here:
complaints[is_noise & in_brooklyn][:10]

Or, to limit the columns we return -- we can specify them:

In [None]:
complaints[is_noise & in_brooklyn][['Complaint Type', 'Borough', 'Created Date', 'Descriptor']][:10]

## 4. Get information about data
### 4.1. Display the index, columns, and the underlying numpy data

In [None]:
complaints.index

In [None]:
complaints.columns

In [None]:
complaints.values

In [None]:
complaints.dtypes

### 4.2. Describe data

In [None]:
complaints.describe()
# Only the numeric columns

*Now we can see the min and max, and check if we were right!*

In [None]:
# We define a new function get_highest for DataFrame
def get_highest(data, name_column):
    highest = 0
    for value in data.loc[:,name_column]:
        if float(value) > highest:
            highest = float(value)
    return highest

In [None]:
get_highest(complaints, 'Latitude')

## 5. Reorganizing data
### 5.1. Transposing data 

In [None]:
complaints.T

### 5.2. Sorting data

In [None]:
complaints.sort_values(by = 'Unique Key').head()
# add arg inplace = True to perform operation in-place

## 6. Transforming data
### 6.1. Transforming existing data
*To avoid transforming df, we first make a copy*

In [None]:
complaints_copy = complaints.copy()
# df_copy = df only creates a alias, not a new object !

*You just have to assign new values to a selection *

In [None]:
complaints_copy.loc[0:2,'Unique Key'] = np.array([0,0,0])
complaints_copy.head()

### 6.2. Adding new data

In [None]:
# Adding a column
complaints_copy['New_column'] = np.array(['new_info']*len(complaints_copy))
complaints_copy.head()

### 6.3. Reseting Index

In [None]:
complaints_copy = complaints.sort_values(by = 'Unique Key')
complaints_copy.head()

In [None]:
complaints_copy.reset_index(inplace=True, drop = True)
complaints_copy.head()
# drop = True means not to keep the old index

### 6.4. Dealing with NA

In [None]:
# To get the boolean mask where values are NaN 
complaints.isnull().head()

In [None]:
# Filling missing data
complaints_copy = complaints.copy()
complaints_copy.fillna(value = 0, inplace=False).head()

In [None]:
# Also a function for a serie 
complaints_copy['Bridge Highway Name'].fillna(value='Unknown', inplace= True)
complaints_copy.head()

In [None]:
# Dropping NaN
complaints.dropna(axis = 1, how = 'any', inplace = False).head()
# axis where missing values are dropped, 0 <=> row and 1 <=> column
# how : if 'any' NaN are present, drop that label 
#       if 'all' labels are NA, drop that label

## 7. Counting Values And Filtering<a name="_counting values and filtering"></a>

What's the most common complaint type? This is a really easy question to answer! There's a `.value_counts()` method that we can use:

In [None]:
complaints['Complaint Type'].value_counts()

If we just wanted the top 10 most common complaints, we can do this:

In [None]:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]

But it gets better! We can plot them!

In [None]:
complaint_counts[:10].plot(kind='bar')

We can also see how many unique types of complaints there are in the column, using `.unique` and len (for length):

In [None]:
complaints['Agency'].unique()

In [None]:
# get the number of unique values for a feature
len(complaints['Agency'].unique())

## 8. Application : Which Borough has the Most Noise Complaints?<a name="_ 8. application : which borough has the most noise complaints?"></a>

To get the noise complaints, we first need to find the rows where the "Complaint Type" column is "Noise - Street/Sidewalk". We use boolean indexing (3.2.)

In [None]:
noise_complaints = complaints[  complaints['Complaint Type'] == "Noise - Street/Sidewalk"  ] 
noise_complaints[0:10]

In [None]:
noise_complaints['Borough'].value_counts()

It's Manhattan! But what if we wanted to divide by the total number of complaints, to make it make a bit more sense? That would be easy too:

In [None]:
complaints['Borough'].value_counts()

In [None]:
results = noise_complaints['Borough'].value_counts() / complaints['Borough'].value_counts()
results

What if we want to graph this?
The default is a line graph, which is not the right type for this kind of data.  This data is not timeseries (where the X axis is dates/times).  This data is count data by categories that are not ordered -- "boroughs."  The proper type of chart for this is a bar graph.

In [None]:
results.plot()

In [None]:
# we can use the keyword argument kind for this:
results.plot(kind = 'bar')