# Intro to pandas and Seaborn

In [None]:
# Import libraries
import pandas as pd
import seaborn as sns

# 1. Pandas
An open source data analysis and manipulation tool, built on top of Python.

**References:**

[Pandas documentation](https://pandas.pydata.org/docs/). Dry and technical, but authoritative.

## Series
Series and DataFrames are the fundamental building blocks (data structures) in pandas.

A Series is a single-column list with an index.

In [None]:
# Make a series from a list
ser = pd.Series([5, 10, 7])

In [None]:
ser

**Task 1:** Add a number to the series.

**Task 2:** Make one of the numbers a float (e.g., 7.5). Then run `ser.info()`. What does it do to the data type of the series?

## DataFrame
A DataFrame is a table, consisting of entries and values. Each entry corresponds to a row and column. Here we declare a dataframe using a dictionary: The keys are the column names, the values are lists of entries.

In [None]:
# Make a dataframe from a dictionary and lists
df1 = pd.DataFrame({'Student': ['Bart', 'Lisa', 'Milhouse', 'Sideshow Bob'], 'Score':[5, 10, 7.5, 8.5], 'Hair colour':['Yellow', 'Yellow', 'Blue', 'Red']})

In [None]:
df1

**Task 1:** Add a student (after Sideshow Bob) and a score to the lists for 'Student' and 'Score'.

**Task 2:** Add a new column to the dataframe, titled 'Hair colour' (yellow, yellow, blue and red).

In [None]:
df1.info()

In [None]:
df1.shape

### Selecting rows

In [None]:
# View selection
df1[df1['Student'].str.contains('Bart')]

In [None]:
# Select rows based on conditions, copy to new dataframe
df1_name = df1[df1['Student'] == 'Lisa']

In [None]:
df1_name

In [None]:
# Select rows with query
df1_query = df1.query('Score < 8')

In [None]:
df1_query

**Task:** In a new dataframe (df_hair), select rows based on hair colour (the column you added above).

In [None]:
df_hair = df1[df1['Hair colour'].str.contains('Red')] # Selected using str.contains()

In [None]:
df_hair

## Read data from file
The functions `read_csv()` and `read_excel()` are the most common methods for loading data from files. Pandas can read a range of data formats, but we will focus on Excel and CSV in this session.

In [None]:
# Load sample data
data_path = r'https://github.com/conceptbin/workshops/raw/main/pandas_data_analysis/data/Workbook%20(Working%20with%20Excel).xlsx'
df2 = pd.read_excel(data_path, sheet_name=0, header=1)

**Try:** Change the header to 0, see what happens to the dataframe.

In [None]:
df2

Pandas offers various ways to summarise numerical data:

In [None]:
df2['sales (£)'].sum() # Add up the sales column

**Task:** Summarise the sales using other functions: mean(), median(), min(), max().

In [None]:
# Copy the code in the previous cell and edit it.

The `describe()` function can generate descriptive statistics for a specific column or the entire dataframe.

In [None]:
df2['sales (£)'].describe().round(2)

# 2. Plotting data
Seaborn is a visualisation library for Python, built on top of Matplotlib.

[Seaborn User Guide and tutorial](https://seaborn.pydata.org/tutorial.html): Accessible introduction to Seaborn's functions.

[Data Visualization](https://www.kaggle.com/learn/data-visualization) tutorial on Kaggle, goes into detail with examples for Seaborn data viz.

## Bar plots
Here are the two dataframes visualised in a simple bar plot using the `barplot()` function.

### Student and score

In [None]:
sns.barplot(data=df1, x="Student", y="Score", color="blue")

### Sales by month
Bar plot of sales by month.

In [None]:
sns.barplot(data=df2, x="Month", y="sales (£)")

Same barplot, now all bars in one colour:

In [None]:
sns.barplot(data=df2, x="Month", y="sales (£)", color="blue", saturation=0.4)

Histogram showing distribution of values across the dataset, with the `displot()` function.

In [None]:
# Histogram of sales-column:
sns.displot(df2, x="sales (£)")

In [None]:
# Same histogram, with more bins (ranges):
sns.displot(df2, x="sales (£)", bins=10)

In [None]:
# Again, same histogram, with a gap between columns
sns.displot(df2, x="sales (£)", bins=10, shrink=.8)