# 1. Pandas Intro

### Resources
+ [Pandas Official Documentation!](http://pandas.pydata.org/pandas-docs/stable/)

# Welcome to Pandas

### What is Pandas?
Pandas is one of the better open source data exploration libraries currently available. It gives the user power to explore, manipulate, query, aggregate, and visualize **tabular** data. Tabular meaning data that is two dimensional with rows and columns, i.e. a table.


### NumPy
NumPy ('numerical Python') is the most popular third-party Python library for scientific computing and forms the foundation for dozens of others, including Pandas. NumPy's primary data structure is a fast n-dimensional array.

### Pandas is built directly on NumPy
You can think of Pandas as a higher-level, easier to use interface to doing data analysis than NumPy.

# The DataFrame - the primary data structure in Pandas

The **DataFrame** is our two-dimensional data structure that looks like any other table of data you have seen with rows and columns.

![](images/Components of a DataFrame.png)


# Import Pandas and read in data
By convention pandas is imported and aliased as **`pd`**. We will read in the **`bikes`** dataset with the **`read_csv`** function. Its first parameter is the location of the file relative to the current directory.


In [2]:
import pandas as pd
bikes = pd.read_csv('data/bikes.csv')

## Display DataFrame in Jupyter Notebook

We assigned our DataFrame to the **`bikes`** variable. Let's output just the first 5 rows with the **`head`** method.

In [3]:
bikes.head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


## First and Last `n` rows
Both the **`head`** and **`tail`** methods take a single parameter **`n`** which control the number of rows to return:

In [4]:
bikes.tail(8)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
50081,17534131,Subscriber,Male,2017-12-29 16:09:00,2017-12-29 16:19:00,617,Kingsbury St & Erie St,41.893882,-87.641711,23.0,Canal St & Adams St,41.879255,-87.639904,47.0,14.0,1.5,6.9,0.0,snow
50082,17534773,Subscriber,Male,2017-12-30 10:47:00,2017-12-30 10:53:00,363,Larrabee St & Oak St,41.900219,-87.642985,19.0,Halsted St & Blackhawk St (*),41.908537,-87.648627,20.0,3.9,10.0,13.8,-9999.0,mostlycloudy
50083,17534831,Subscriber,Male,2017-12-30 11:36:00,2017-12-30 11:55:00,1175,Western Ave & Walton St,41.898418,-87.686596,19.0,Damen Ave & Clybourn Ave,41.931931,-87.677856,15.0,3.9,10.0,13.8,-9999.0,partlycloudy
50084,17534938,Subscriber,Male,2017-12-30 13:07:00,2017-12-30 13:34:00,1625,State St & Pearson St,41.897448,-87.628722,27.0,Clark St & Elm St,41.902973,-87.63128,27.0,5.0,10.0,16.1,-9999.0,partlycloudy
50085,17534969,Subscriber,Male,2017-12-30 13:34:00,2017-12-30 13:44:00,585,Halsted St & 35th St (*),41.830661,-87.647172,16.0,Union Ave & Root St,41.819102,-87.643278,11.0,5.0,10.0,16.1,-9999.0,partlycloudy
50086,17534972,Subscriber,Male,2017-12-30 13:34:00,2017-12-30 13:48:00,824,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,Halsted St & Blackhawk St (*),41.908537,-87.648627,20.0,5.0,10.0,16.1,-9999.0,partlycloudy
50087,17535645,Subscriber,Female,2017-12-31 09:30:00,2017-12-31 09:33:00,178,Clinton St & Lake St,41.885637,-87.641823,23.0,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,7.0,10.0,11.5,-9999.0,partlycloudy
50088,17536246,Subscriber,Male,2017-12-31 15:22:00,2017-12-31 15:26:00,214,Clarendon Ave & Leland Ave,41.967968,-87.650001,15.0,Clifton Ave & Lawrence Ave,41.968812,-87.657659,15.0,10.9,10.0,15.0,-9999.0,partlycloudy


# What type of object is `bikes`
As we said previously **`bikes`** is a DataFrame. Let's verify this. Only the name after the last dot is the actual class name.

In [5]:
type(bikes)

pandas.core.frame.DataFrame

# Select a single column from a DataFrame - a Series
To select a single column from a DataFrame, pass the name of one of the columns to the indexing operator, **`[]`**. The returned object will be a Pandas Series. Let's choose the column name **`tripduration`**, assign it to a variable, and output it to the screen.

In [6]:
trip_duration = bikes['tripduration']
trip_duration.head()

0     993
1     623
2    1040
3     667
4     130
Name: tripduration, dtype: int64

# Components of a Series - Index and Data
A Series is simpler than a DataFrame with just a single dimension of data. It has two components - the **index** and the **data**. It is essentially a one-column DataFrame. Let's take a look at a stylized Series graphic

![](images/Components of a Series.png)

# Data  Types 
Each column of data in Pandas DataFrame has a **data type**. This is a very similar concept to types in Python. Just like every object has a type, every column has a data type. Every value in each column must be of the same data type.

## Most Common Data Types
The following are the most common data types that appear frequently in DataFrames. 

* **Boolean** - only two values: **`True`** and **`False`**
* **Integer** - whole numbers without a decimal
* **Float** - numbers with decimals
* **Object** - mainly strings
* **DateTime** - a specific moment in time


# Missing Value Representation, `NaN`,  `None`, and `NaT`
Pandas uses different representation based on the data type of the column.

# Finding the data types of each column
The **`dtypes`** DataFrame method returns the data type of each column. Let's get the data types of our **`bikes`** DataFrame.

[1]: https://docs.scipy.org/doc/numpy/neps/missing-data.html

In [7]:
bikes.dtypes

trip_id                int64
usertype              object
gender                object
starttime             object
stoptime              object
tripduration           int64
from_station_name     object
latitude_start       float64
longitude_start      float64
dpcapacity_start     float64
to_station_name       object
latitude_end         float64
longitude_end        float64
dpcapacity_end       float64
temperature          float64
visibility           float64
wind_speed           float64
precipitation        float64
events                object
dtype: object

# Think string whenever you see object
Pandas does not have a string data type like most databases but when you see **object** you should assume that the column consists entirely of strings. While this won't always hold true, in this class and in the vast majority of cases it will be so.

# Why are `starttime` and `stoptime` object data types?
By default, Pandas reads in columns that look like DateTimes as strings. Use the **`parse_dates`** parameter to inform it which columns should turn into DateTimes.

In [8]:
bikes = pd.read_csv('data/bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes.dtypes

trip_id                       int64
usertype                     object
gender                       object
starttime            datetime64[ns]
stoptime             datetime64[ns]
tripduration                  int64
from_station_name            object
latitude_start              float64
longitude_start             float64
dpcapacity_start            float64
to_station_name              object
latitude_end                float64
longitude_end               float64
dpcapacity_end              float64
temperature                 float64
visibility                  float64
wind_speed                  float64
precipitation               float64
events                       object
dtype: object

# Getting more Metadata
Metadata is data on the data. The data type of each column is an example of **metadata**. The number of rows and columns is another piece of metadata. We find this with the **`shape`** attribute:

In [9]:
bikes.shape

(50089, 19)

### Get Data Types plus more with `info` method
The **`info`** DataFrame method retuns output similar to **`dtypes`** but also returns the number of non-missing values in each column.

In [10]:
bikes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50089 entries, 0 to 50088
Data columns (total 19 columns):
trip_id              50089 non-null int64
usertype             50089 non-null object
gender               50089 non-null object
starttime            50089 non-null datetime64[ns]
stoptime             50089 non-null datetime64[ns]
tripduration         50089 non-null int64
from_station_name    50089 non-null object
latitude_start       50083 non-null float64
longitude_start      50083 non-null float64
dpcapacity_start     50083 non-null float64
to_station_name      50089 non-null object
latitude_end         50077 non-null float64
longitude_end        50077 non-null float64
dpcapacity_end       50077 non-null float64
temperature          50089 non-null float64
visibility           50089 non-null float64
wind_speed           50089 non-null float64
precipitation        50089 non-null float64
events               50089 non-null object
dtypes: datetime64[ns](2), float64(10), int64(2), 

# Exercises
Use the **`bikes`** DataFrame for the following:

### Problem 1
<span  style="color:green; font-size:16px">Select the column **`events`**, the type of weather that was recorded and assign it to a variable with the same name. Output the first 10 values of it.</span>

In [13]:
events = bikes['events']
events.head(10)

0    mostlycloudy
1    partlycloudy
2    mostlycloudy
3    mostlycloudy
4    partlycloudy
5    mostlycloudy
6          cloudy
7          cloudy
8          cloudy
9    mostlycloudy
Name: events, dtype: object

### Problem 2
<span  style="color:green; font-size:16px">What type of object is **`events`**?</span>

In [15]:
type(events)

pandas.core.series.Series

### Problem 3
<span  style="color:green; font-size:16px">Select the last 2 rows of the **`bikes`** DataFrame and assign it to the variable **`bikes_last_2`**. What type of object is **`bikes_last_2`**?</span>

In [17]:
bikes_last_2 = bikes.tail(2)

### Problem 4
<span  style="color:green; font-size:16px">What type of object is returned from the **`dtypes`** attribute?</span>

In [18]:
bikes_last_2.dtypes

trip_id                       int64
usertype                     object
gender                       object
starttime            datetime64[ns]
stoptime             datetime64[ns]
tripduration                  int64
from_station_name            object
latitude_start              float64
longitude_start             float64
dpcapacity_start            float64
to_station_name              object
latitude_end                float64
longitude_end               float64
dpcapacity_end              float64
temperature                 float64
visibility                  float64
wind_speed                  float64
precipitation               float64
events                       object
dtype: object