In [34]:
import pandas as pd

## Introduction to Pandas

**pandas** is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming **the most powerful and flexible open source data analysis / manipulation tool available in any language**. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:
* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas - *Series* (1-dimensional) and *DataFrame* (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Here are just a few of the things that pandas does well:
* Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
* Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
* Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
* Intuitive merging and joining data sets
* Flexible reshaping and pivoting of data sets
* Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
* Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific research environments. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks.

Some other notes
* **pandas** is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code. However, as with anything else generalization usually sacrifices performance. So if you focus on one feature for your application you may be able to create a faster specialized tool.
* pandas is a dependency of statsmodels, making it an important part of the statistical computing ecosystem in Python.
* pandas has been used extensively in production in financial applications.

<br>

## Representation

All data is loaded directly into the RAM and is optimised to use memory efficiently. The data in memory can be thought of as below:

| column1 | column2 | Gender |
| -------- | -------- | -------- |
| Allen | Varghese | Male |
| Kevin | O'Brien | Male |
| Mihai | Todor | Male |


The main data structure in **pandas** is **DataFrame** which manages data in the above format and is accessible for computation as a python **dictionary**. Lets create the above information as a pandas dataframe.

In [35]:
# Create a python dictionary
data = {
    "column1": ["Allen", "Kevin", "Mihai"],
    "column2": ["Varghese", "O'Brien", "Todor"],
    "Gender": ["Male", "Male", "Male"],
    "some_random_numbers": [4200, 2750, 3820]
}

# Create the DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,column1,column2,Gender,some_random_numbers
0,Allen,Varghese,Male,4200
1,Kevin,O'Brien,Male,2750
2,Mihai,Todor,Male,3820


An `index` column is added automatically by pandas. This is to keep track of rows and for fast manipulation of data by easy slicing. Individual rows can be accessed by the index.

In [36]:
df.loc[1]

column1                  Kevin
column2                O'Brien
Gender                    Male
some_random_numbers       2750
Name: 1, dtype: object

In [37]:
df["column1"]

0    Allen
1    Kevin
2    Mihai
Name: column1, dtype: object

In [38]:
# Convert to a Numpy array
df["column1"].values

array(['Allen', 'Kevin', 'Mihai'], dtype=object)

In [39]:
# Convert to a Python list
df["column1"].tolist()

['Allen', 'Kevin', 'Mihai']

**NOTE:** Data in a pandas DataFrame is linked as a collection of columns rather than a collection of rows. Thus it is very
fast to access and manipulate data in columns rather than rows. It is important to use appropriate data modelling
techniques to convert available data into a more accessible format

<br>

## Inspecting DataFrames

Below are some useful DataFrame inspection functions

In [40]:
# Summary of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   column1              3 non-null      object
 1   column2              3 non-null      object
 2   Gender               3 non-null      object
 3   some_random_numbers  3 non-null      int64 
dtypes: int64(1), object(3)
memory usage: 224.0+ bytes


In [41]:
# Get the columns
df.columns

Index(['column1', 'column2', 'Gender', 'some_random_numbers'], dtype='object')

In [42]:
# Get the columns as a Python list
df.columns.tolist()

['column1', 'column2', 'Gender', 'some_random_numbers']

In [43]:
# You can also use the Python function `list`
list(df.columns)

['column1', 'column2', 'Gender', 'some_random_numbers']

In [44]:
# Get the index
df.index

RangeIndex(start=0, stop=3, step=1)

In [45]:
# Sneak peek. Displays the first 5 rows by default
df.head()

Unnamed: 0,column1,column2,Gender,some_random_numbers
0,Allen,Varghese,Male,4200
1,Kevin,O'Brien,Male,2750
2,Mihai,Todor,Male,3820


In [46]:
# Summary Statistics
df.describe()

Unnamed: 0,some_random_numbers
count,3.0
mean,3590.0
std,751.864349
min,2750.0
25%,3285.0
50%,3820.0
75%,4010.0
max,4200.0


In [47]:
# Extract a specific row
df.describe().loc["25%"]

some_random_numbers    3285.0
Name: 25%, dtype: float64

In [48]:
# Display the first row
df.head(1)

Unnamed: 0,column1,column2,Gender,some_random_numbers
0,Allen,Varghese,Male,4200


In [49]:
# Display the last row
df.tail(1)

Unnamed: 0,column1,column2,Gender,some_random_numbers
2,Mihai,Todor,Male,3820


In [50]:
# Number of rows in a DataFrame
len(df)

3

For looking up details about a DataFrame or a function use "dir" or "help"

In [None]:
dir(df)

In [None]:
help(df.loc)

<br>

## Manipulating Data

Adding and removing columns from a DataFrame can be done on the fly and makes data manipulation very easy

In [51]:
df

Unnamed: 0,column1,column2,Gender,some_random_numbers
0,Allen,Varghese,Male,4200
1,Kevin,O'Brien,Male,2750
2,Mihai,Todor,Male,3820


In [52]:
# Assigning a new column by using an equal length Python list
# assigned to a new column
df["Location"] = ["Dublin", "Limerick", "Dublin"]
df

Unnamed: 0,column1,column2,Gender,some_random_numbers,Location
0,Allen,Varghese,Male,4200,Dublin
1,Kevin,O'Brien,Male,2750,Limerick
2,Mihai,Todor,Male,3820,Dublin


In [53]:
# A subset of columns can be extracted from a DataFrame to remove extra columns
name_df = df[["column1", "column2"]]
name_df

Unnamed: 0,column1,column2
0,Allen,Varghese
1,Kevin,O'Brien
2,Mihai,Todor


In [54]:
# Rename columns
name_df = name_df.rename(
    columns={
        "column1": "First Name",
        "column2": "Last Name"
    }
)
name_df

Unnamed: 0,First Name,Last Name
0,Allen,Varghese
1,Kevin,O'Brien
2,Mihai,Todor


In [55]:
# Extract the 1st letter of first name
name_df["FN_1"] = name_df["First Name"].map(lambda x: x[0])
name_df

Unnamed: 0,First Name,Last Name,FN_1
0,Allen,Varghese,A
1,Kevin,O'Brien,K
2,Mihai,Todor,M


In [56]:
# TODO: Extract the last 3 letters of Last Name in a new column "LN_3"
name_df["LN_3"] = name_df["Last Name"].map(lambda x: x[-3:])
name_df

Unnamed: 0,First Name,Last Name,FN_1,LN_3
0,Allen,Varghese,A,ese
1,Kevin,O'Brien,K,ien
2,Mihai,Todor,M,dor


<br>
**Load Data**

Data can be loaded from external sources like CSV files, Excel and databases.

In [None]:
csv_df = pd.read_csv("datasets/weather_2012.csv")
csv_df.info()

In [None]:
csv_df.head()

In [None]:
csv_df["Weather"].unique()

<br>

## Filtering data

In [None]:
csv_df["Weather"].str.contains("Rain")

In [None]:
csv_df["Weather"][csv_df["Weather"].str.contains("Rain")]

In [None]:
csv_df["Weather"][csv_df["Weather"].str.contains("Rain")].head(1)

In [None]:
# Find only "Rain" weather reports
csv_df["Weather"][csv_df["Weather"].str.contains("Rain")].unique()

In [None]:
# Find rows that has weather value as "Fog"
csv_df[csv_df["Weather"] == "Fog"]

In [None]:
# Find rows where temperature is more than 10 deg C
csv_df[(csv_df["Temp (C)"] > 10.0) & (csv_df["Weather"].isin(["Cloudy", "Clear"]))]

In [None]:
# Group by on "Weather" column and count the number of records for each category
weather_grpby = csv_df.groupby("Weather")["Dew Point Temp (C)"].count()
weather_grpby

In [None]:
# Group by on "Weather" column and count the number of records for each category
weather_grpby = csv_df.groupby("Weather", as_index=False)["Dew Point Temp (C)"].count()
weather_grpby

In [None]:
# Plot a line graph for "Temp (C)" field

# Plot function by default creates a "line" graph.
# The figure size is set for (width, height) in inches
csv_df["Temp (C)"].plot(figsize=(17, 8))

In [None]:
# Use the groupby result from earlier to find the weather with highest number of records
weather_grpby.plot(kind="bar", figsize=(17, 8))

In [None]:
bikes = pd.read_csv(
    'datasets/bikes.csv', sep=';', encoding='latin1',
    parse_dates=['Date'], dayfirst=True, index_col='Date'
)
bikes.info()

In [None]:
bikes.head()

In [None]:
bikes["Berri 1"].plot(figsize=(17, 9))

In [None]:
# Make a copy of only the "Berri 1" data
berri_bikes = bikes[['Berri 1']].copy()
berri_bikes.head()

Next, we need to add a 'weekday' column. Firstly, we can get the weekday from the index. It's basically all the days of the year. Pandas has a bunch of really great time series functionality, so if we wanted to get the day of the month for each row, we could do it like this:

In [None]:
berri_bikes.index.day

We actually want the weekday, though:

In [None]:
berri_bikes.index.weekday

These are the days of the week, where 0 is Monday. I found out that 0 was Monday by checking on a calendar.
Now that we know how to get the weekday, we can add it as a column in our dataframe like this:

In [None]:
#berri_bikes.loc[:,'weekday'] = berri_bikes.index.weekday
berri_bikes["weekday"] = berri_bikes.index.weekday

#berri_bikes[:5]
berri_bikes.head()

Adding up cyclists by weekday

In [None]:
weekday_counts = berri_bikes.groupby('weekday').sum()
weekday_counts

It's hard to remember what 0, 1, 2, 3, 4, 5, 6 mean, so we can fix it up and graph it:

In [None]:
weekday_counts.index = [
    'Monday', 'Tuesday', 'Wednesday', 'Thursday',
    'Friday', 'Saturday', 'Sunday'
]
weekday_counts

In [None]:
weekday_counts.plot(kind='bar', figsize=(12, 9))

In [None]:
# TODO: Repeat the above exercise with Month

In [None]:
berri_bikes.index.month

In [None]:
berri_bikes["month"] = berri_bikes.index.month
berri_bikes

In [None]:
month_counts = berri_bikes.groupby("month").sum()
month_counts

In [None]:
month_counts.index = [
    "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November"
]
month_counts

In [None]:
month_counts["Berri 1"].plot(kind='bar', figsize=(10, 6))

<br>

## Persistence

Data in a DataFrame can be saved to a file or a database. Lets look at both scenarios.

### Save to File

In [None]:
weekday_counts

In [None]:
# Saves all the data including the index
weekday_counts.to_csv("datasets/bike_travel_weekday_count.csv")

In [None]:
df

In [None]:
# Save DataFrame without index
df.to_csv("sample_dataset.csv", index=False)

### Save to Database

In [None]:
import sqlite3
db_conn = sqlite3.connect("workshop_db.sqlite")
df.to_sql("person_details", db_conn, if_exists="replace", index=False)

In [None]:
df_table = pd.read_sql("select * from person_details", db_conn)
df_table