## [**``Pandas``**](https://pandas.pydata.org/) -- **Pan**el **da**ta **s**ets -- framework for multi-dimensional data sets
_____________________________

* A powerful data analysis and manipulation (an open-source) library for Python providing fast, flexible, and expressive **data structures** designed to make working with "relational" or "labeled" data both easy and intuitive

* The fundamental high-level building block for doing practical, **real world** data analysis in Python


0. [A guide to many pandas tutorials](http://pandas.pydata.org/pandas-docs/version/0.15/tutorials.html)

1. [Yufeng: Wrangling Data with Pandas (AI Adventures)](https://www.youtube.com/watch?v=XDAnFZqJDvI)

2. [Python Pandas Tutorial 1. What is Pandas python? Introduction](https://www.youtube.com/watch?v=CmorAWRsCAw) from
[codebasics](https://github.com/codebasics/py/tree/master/pandas)

3. [Kevin Markham: Best practices with pandas](https://www.youtube.com/playlist?list=PL5-da3qGB5IBITZj_dYSFqnd_15JgqwA6)

4. [Jake VanderPlas: Python Data Science Handbook, chapt.3](https://github.com/jakevdp/PythonDataScienceHandbook)

5. [Corey Schafer: **Python Pandas Tutorial** (Part 1): Getting Started with Data Analysis - Installation and Loading Data](https://www.youtube.com/watch?v=ZyhVh-qRZPA&list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS&ab_channel=CoreySchafer)

6. [pandas.kaggle](https://www.kaggle.com/learn/pandas)

In [None]:
import pandas as pd

* ``DataFrame`` is the most important part of the Pandas
* ``DataFrame`` holds the type of data you might think of as a **table**. This is similar to a sheet in Excel, or a table in a SQL database.


In [None]:
df = pd.read_csv('nyc_weather.csv')

In [None]:
dir(df)


In [None]:
help(df.to_dict)

In [None]:
# what does each row represent? 
df

In [None]:
df.shape

In [None]:
import matplotlib.pyplot as plt
%matplotlib widget
#%matplotlib inline
#df['Temperature'].plot(kind = 'bar');
df['Temperature'].plot();
df['Humidity'].plot();

In [None]:
df['EST'][df['Events']=='Rain']

In [None]:
df.fillna(0, inplace=True)
df['WindSpeedMPH'].mean()

In [None]:
df

In [None]:
df.describe()

In [None]:
df.PrecipitationIn.sum()

In [None]:
# Maximum value in the data set
maxValue = df['Temperature'].max()
print(maxValue)

In [None]:
print("The most warm days")
df[df['Temperature'] == maxValue]
#df[df['Temperature'] == df['Temperature'].max()]

#### Main features
__________________________

  - Easy handling of missing data in floating point as well as non-floating
    point data
  - Size mutability: columns can be inserted and deleted from DataFrame 
  - Automatic and explicit data alignment: objects can  be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and let
    `Series`, `DataFrame`, etc. automatically align the data for computations
  - Powerful, flexible group by functionality to perform split-apply-combine
    operations on data sets, for both aggregating and transforming data
  - Make it easy to convert ragged, differently-indexed data in other Python
    and NumPy data structures into DataFrame objects
  - Intelligent label-based slicing, fancy indexing, and subsetting of large
    data sets
  - Intuitive merging and joining data sets
  - Flexible reshaping and pivoting of data sets
  - Hierarchical labeling of axes (possible to have multiple labels per tick)
  - Robust IO tools for loading data from flat files (CSV and delimited),
    Excel files, databases, and saving/loading data from the ultrafast HDF5
    format
  - Time series-specific functionality: date range generation and frequency
    conversion, moving window statistics, moving window linear regressions,
    date shifting and lagging, 

###  Dataset: Stanford Open Policing Project  ([video](https://www.youtube.com/watch?v=hl-TGI4550M&list=PL5-da3qGB5IBITZj_dYSFqnd_15JgqwA6&index=1))

https://openpolicing.stanford.edu/


In [None]:
# ri stands for Rhode Island
ri = pd.read_csv('police.csv')

In [None]:
# what does each row represent?
#ri
ri.head()

In [None]:
ri.shape

In [None]:
# make sure you create this column
combined = ri.stop_date.str.cat(ri.stop_time, sep=' ')
type(combined)

In [None]:
ri['stop_datetime'] = pd.to_datetime(combined)
ri.groupby(ri.stop_datetime.dt.hour).stop_date.count()
#ri.groupby(ri.stop_datetime.dt.hour).stop_date.count().plot()

In [None]:
ri.shape

In [None]:
ri.head()