<a href="https://colab.research.google.com/github/Vincenzo-Miracula/Zayed-University/blob/main/LectureSession1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python for Data Analysis

### Research Computing Services

Instructor: Vincenzo Miracula <br>
Website: [Linkedin](https://www.linkedin.com/in/vincenzo-miracula/) <br>
Tutorial materials: [http://rcs.bu.edu/examples/python/DataAnalysis](http://rcs.bu.edu/examples/python/DataAnalysis)<br>
Contact us: vincenzo.miracula@phd.unict.it <br>

## Course Content
1. Python packages for data scientists
2. Data manipulation with Pandas
3. Basic data plotting
4. Descriptive statistics
5. Inferential statistics

## Python packages for data scientists
* [Pandas](https://pandas.pydata.org)
    - Adds data structures and tools designed to work with table-like data (similar to Vectors and Data Frames in R)
    - Provides tools for data maniuplation: *reshaping*, *merging*, *sorting*, *slicing*, *aggregation*, etc.
    - Easily allows to handle missing data
* [SciKit-Learn](https://scikit-learn.org/stable/)
    - Provides machine learning algorithms: classification, regression, clustering, model validation, etc.
    - Built on NumPy, Scipy, and matplotlib
   
### Visualization

* [matplotlib](https://matplotlib.org/)
    - Python 2-D plotting library for pulibcation quality figures in a variety of hardcopy formats
    - Functionalities similar to MATLAB
    - Line plots, scatter plots, bar charts, histograms, pie charts, etc.
    - Effort needed to create advanced visualizations
* [seaborn](https://seaborn.pydata.org/)
    - Based on matplotlib
    - Provides a high-level interface for drawing attractive statistical graphs
    - Similar to the ggplot2 library in R

## Loading Python libraries

In [2]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Pandas
The main focus of this tutorial is using the Pandas library to manipulate and analyze data.

Pandas is a python package that deals mostly with :
- **Series**  (1-D homogeneous array)
- **DataFrame** (2-D labeled heterogeneous array)
- **MultiIndex** (for hierarchical data)
- **Xarray** (built on top of Pandas for n-D arrays)

The Pandas content of this tutorial will cover:
* Creating and understanding Series and DataFrames
* Importing/Reading data
* Data selection and filtering
* Data maniuplation via sorting, grouping, and rearranging
* Handling missing data


In addition we will also provide information on the following.
* Basic data plotting
* Descriptive statistics (time permitting)
* Inferential statistics (time permitting)

### Pandas Series

A Pandas *Series* is a 1-dimensional labeled array containing data of the same type (integers, strings, floating point numbers, Python objects, etc. ). It is a generalized numpy array with an explicit axis called the *index*.

In [None]:
# Example of creating Pandas series:
# Order all S1 together
s1 = pd.Series([-3, -1, 1, 3, 5])
print(s1)

In [None]:
s1[:2] # First 2 elements

In [None]:
print(s1[[2,1,0]])  # Elements out of order

In [None]:
type(s1)

In [None]:
# Creating Pandas series with index:
rng = np.random.default_rng()
s2 = pd.Series(rng.normal(size=5), index=['a', 'b', 'c', 'd', 'e'])
print(s2)

In [None]:
# Create a Series from dictionary
data = {'pi': 3.14159, 'e': 2.71828}  # dictionary
print(data)
s3 = pd.Series(data)
print(s3)

In [None]:
# Create a new series from a dictionary and reorder the elements
s4 = pd.Series(data, index=['e', 'pi', 'tau'])
print(s4)

In [None]:
# Series can be treated as a 1-D array and you can apply functions to them:
print("Median:", s4.median())

### Attributes and Methods:
An attribute is a variable stored in the object, e.g., index or size with Series.
A method is a function stored in the object, e.g., head() or median() with Series.

|  Attribute/Method | Description |
|-----|-----|
| dtype | data type of values in series |
| empty | True if series is empty |
| size | number of elements |
| values | Returns values as ndarray |
| head() | First n elements |
| tail() | Last n elements |

Execute *dir(s1)* to see all attributes and methods.

I recommend using online documentation as well. This will be in a much easier format to read and come with examples.



In [None]:
# For more information on a particular method or attribute use the help() function
help(s4.head())

In [None]:
# You can also add a question mark to get help information
s4.head?

In [None]:
s4.index?

One final way to get help is to press shift-tab when you are in the parentheses of a method or after an attribute. Try this in the exercise below.

### Exercise - Create your own Series

In [None]:
# Create a series with 10 elements containing both positive and negative integers
# Examine the series with the head() method
# Create a new series from the originally created series with only negative numbers
# <your code goes here >
# mySeries = pd.Series(  ...  )

### Pandas DataFrames

A Pandas *DataFrame* is a 2-dimensional, size-mutable, heterogeneous tabular data structure with labeled rows and columns. You can think of it as a dictionary-like container to store Python Series objects.

In [None]:
df = pd.DataFrame({'Name': pd.Series(['Alice', 'Bob', 'Chris']),
                  'Age': pd.Series([21, 25, 23])})
print(df)

In [None]:
df2 = pd.DataFrame(np.array([['Alice','Bob','Chris'], [21, 25, 23]]).T, columns=['Name','Age'])

In [None]:
# Use the head() method to print the first 5 records in the dataframe (same as with series)
df2.head()

In [None]:
# Add a new column to d2:
df2['Height'] = pd.Series([5.2, 6.0, 5.6])
df2.head()

In [None]:
df2

### Reading data using Pandas
You can read CSV (comma separated values) files using Pandas. The command shown below reads a CSV file into the Pandas dataframe df.

In [1]:
import pandas as pd

In [2]:
# Read a csv file into Pandas Dataframe
df = pd.read_csv("http://rcs.bu.edu/examples/python/DataAnalysis/Salaries.csv")

In [None]:
# Display the first 10 records
df.head(10)

In [None]:
# Display structure of the data frame
df.info()

### More details on DataFrame data types

|Pandas Type | Native Python Type | Description |
|------------|--------------------|-------------|
| object | string | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).|
| int64  | int | Numeric characters. 64 refers to the memory allocated to hold this character. |
| float64 | float | Numeric characters with decimals. If a column contains numbers and NaNs (see below), pandas will default to float64, in case your missing value has a decimal. |
| datetime64, timedelta\[ns\]| N/A (but see the datetime module in Python’s standard library) | Values meant to hold time data. Look into these for time series experiments. |


### DataFrame attributes
|df.attribute | Description |
|-------------|-------------|
| dtypes | list the types of the columns |
| columns | list the column names |
| axes | list the row labels and column names |
| ndim | number of dimensions |
| size | number of elements |
| shape | return a tuple representung the dimensionality |
| values | numpy representation of the data |

### Dataframe methods
|df.method() | Description |
|-------------|-------------|
| head(\[n\]), tail(\[n\]) | first/last n rows |
| describe() | generate descriptive statistics (for numeric columns only) |
| max(), min() | return max/min values for all numeric columns |
| mean(), median() | return mean/median values for all numeric columns |
| std() | standard deviation |
| sample(\[n\]) | returns a random sample of n elements from the data frame |
| dropna() | drop all the records with missing values |

Sometimes the column names in the input file are too long or contain special characters. In such cases we rename them to make it easier to work with these columns.

In [None]:
# Let's create a copy of this dataframe with a new column names
# If we do not want to create a new data frame, we can add inplace=True argument
df_new =df.rename(columns={'sex': 'gender', 'phd': 'yearsAfterPhD', 'service': 'yearsOfService'})
df_new.head()

### DataFrame Exploration

In [None]:
# Identify the type of df_new object
type(df_new)

In [None]:
# Check the data type of the column "salary"
# We access columns using the brackets, e.g., df['column_name']
df_new['salary'].dtype

In [None]:
# If the column name has no spaces, complex symbols, and is not the name of an attribute/method
# you can use the syntax df.column_name
df_new.salary.dtype

In [None]:
# List the types of all columns
df_new.dtypes

In [None]:
# List the column names
df_new.columns

In [None]:
# List the row labels and the column names
df_new.axes

In [None]:
# Number of rows and columns
df_new.shape

In [None]:
# Total number of elements in the Data Frame (78 x 6)
df_new.size

In [None]:
# Output some descriptive statistics for the numeric columns
df_new.describe()

In [None]:
# Remeber we can use the ? to get help about the function
df_new.describe?

In [None]:
# Create a new column using the assign method
df_new = df_new.assign(salary_k=lambda x: x.salary/1000.0)
df_new.head(10)

In [None]:
# Check how many unique values are in a column
# There is a rank attribute in DataFrame object so we access using df['rank']
df_new['rank'].unique()

In [None]:
# Get the frequency table for a categorical or binary column
df_new['rank'].value_counts()

In [None]:
# Get a proportion table
df_new['rank'].value_counts()/sum(df['rank'].value_counts())

### Data slicing and grouping

In [None]:
#Extract a column by name
df_new['gender'].head()

In [None]:
# Calculate median number of service years
df_new['yearsOfService'].median()

### Grouping data

In [None]:
# Group data using rank
df_rank = df_new.groupby('rank')
df_rank.head()

In [None]:
# Calculate mean of all numeric columns for the grouped object
df_rank.mean()

In [None]:
# Most of the time, the "grouping" object is not stored, but is used as a step in getting a summary:
df_new.groupby('gender').mean()

In [None]:
# Calculate the mean salary for men and women. The following produce Pandas Series (single brackets around salary)
df_new.groupby('gender')['salary'].mean()

In [None]:
# If we use double brackets Pandas will produce a DataFrame
df_new.groupby('gender')[['salary']].mean()

In [None]:
# Group using 2 variables - gender and rank:
df_new.groupby(['rank','gender'], sort=True)[['salary']].mean()

### Filtering

In [None]:
# Select observation with the value in the salary column > 120K
df_filter = df_new[df_new.salary > 120000]
df_filter.head()

In [None]:
# Select data for female professors
df_w = df_new[df_new.gender == 'Female']
df_w.head()

### Slicing a dataframe

In [None]:
# Select column salary
salary = df_new['salary']

In [None]:
# Check data type of the result
type(salary)

In [None]:
# Look at the first few elements of the output
salary.head()

In [None]:
# Select column salary and make the output to be a data frame
df_salary = df_new[['salary']]

In [None]:
# Check the type
type(df_salary)

In [None]:
# Select a subset of rows (based on their position):
# Note 1: The location of the first row is 0
# Note 2: The last value in the range is not included

In [None]:
# If we want to select both rows and columns we can use method .loc
df_new.loc[10:20, ['rank', 'gender','salary']]

In [None]:
# Unlike method .loc, method iloc selects rows (and columns) by absolute position:
# iloc = integer location
df_filter.iloc[10:20, [0,3,4,5]]

### Common Aggregation Functions:

The following functions are commonly used functions to aggregate data.

|Function|Description
|-------|--------
|min   | minimum
|max   | maximum
|count   | number of non-null observations
|sum   | sum of values
|mean  | arithmetic mean of values
|median | median
|mad | mean absolute deviation
|mode | mode
|prod   | product of values
|std  | standard deviation
|var | unbiased variance



## Exploring data using graphics

### Graphics with the Salaries dataset

In [None]:
# Use matplotlib to draw a histogram of a salary data
plt.hist(df_new['salary'],bins=8, density=True)

In [None]:
# Use seaborn package to draw a histogram
sns.displot(df_new['salary'])

In [None]:
# Use regular matplotlib function to display a barplot
df_new.groupby(['rank'])['salary'].count().plot(kind='bar')

In [None]:
# Use seaborn package to display a barplot
sns.set_style("whitegrid")
ax = sns.barplot(x='rank',y ='salary', data=df_new, estimator=len)

In [None]:
# Split into 2 groups:
ax = sns.barplot(x='rank',y ='salary', hue='gender', data=df_new, estimator=len)

In [None]:
# Violinplot
sns.violinplot(x = "salary", data=df_new)

In [None]:
# Scatterplot in seaborn
sns.jointplot(x='yearsOfService', y='salary', data=df_new)

In [None]:
# Box plot
sns.boxplot(x='rank',y='salary', data=df_new)

In [None]:
# Side-by-side box plot
sns.boxplot(x='rank', y='salary', data=df_new, hue='gender')

In [None]:
# Pairplot
sns.pairplot(df_new)

## Descriptive statistics
Statistics that are used to describe data. We have seen methods that calculate descriptive statistics before with the DataFrame describe() method.

Descriptive statistics summarize attributes of a sample, such as the min/max values, and the mean (average) of the data. Below is a summary of some additional methods that calculate descriptive statistics.

|Function|Description
|-------|--------
|min   | minimum
|max   | maximum
|mean  | arithmetic mean of values
|median | median
|mad | mean absolute deviation
|mode | mode
|std  | standard deviation
|var | unbiased variance
|sem | standard error of the mean
|skew| sample skewness
|kurt|kurtosis
|quantile| value at %
