Libraries:
- numpy
- pandas
- lets-plot
- scikit-learn
- statsmodels

In [1]:
import pandas as pd
import seaborn as sns

---

# 1: Basic Python and Pandas

In [2]:
# Load the iris dataset into a pandas DataFrame
iris = sns.load_dataset('iris')

# Display the first 5 rows
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


(a) Finding the size and type of a variable

**Type**: What kind of data it holds.
- `dtype` **attribute**: For a pandas series or dataframe.
- `type` **function**: For a standard Python variable.
- `object` type: How pandas represents text data

**Size**: The dimensions of the data. For a DataFrame, this usually means the number of rows and columns.
- `.shape` **attribute**: Returns the dimensions, a tuple in the format `(rows, columns)`.
- `.size` **attribute**: Returns the total number of elements in the object (i.e., rows × columns).
- `len()` **function**: Returns the number of rows

In [20]:
# Get the data type of a single column (a pandas series)
display(iris['species'].dtype)

# Get the data type of all columns in the DataFrame
display(iris.dtypes)

dtype('O')

Unnamed: 0,0
sepal_length,float64
sepal_width,float64
petal_length,float64
petal_width,float64
species,object


In [24]:
# Get the dimensions (rows, columns) of the DataFrame
display(iris.shape)

# Get the total number of elements in the DataFrame
display(iris.size)

# Get the total number of rows
display(len(iris))

(150, 5)

750

150

(b) Subsetting variables by row and column

- **Subsetting**: Process of selecting specific rows or columns from the data.

For columns:
- `.columns` **method**: Returns all columns of a DataFrame.
- `df['column_name']`: Selects a single column. This will return a Pandas Series.
- `df[['col1', 'col2', 'col3']]`: Selects multiple columns. This returns a new DataFrame.

For rows or a combination of rows and columns:
- `.loc[row_labels, column_labels]` **method** (**Label-based indexing**): Selects data based on the row and column labels (names or index values). It's inclusive of the end value.
- `.iloc[row_positions, column_positions]` **method** (**Integer-based indexing**): Selects data based on its integer position (from 0 to length-1). It's exclusive of the end value.

In [6]:
iris.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [22]:
# Select just the species column
display(iris['species'])

# Select the sepal_length and sepal_width columns
display(iris[['sepal_length', 'sepal_width']])

Unnamed: 0,species
0,setosa
1,setosa
2,setosa
3,setosa
4,setosa
...,...
145,virginica
146,virginica
147,virginica
148,virginica


Unnamed: 0,sepal_length,sepal_width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6
...,...,...
145,6.7,3.0
146,6.3,2.5
147,6.5,3.0
148,6.2,3.4


In [23]:
# --- Using .loc ---
# Select the row with index label 3
display(iris.loc[3])

# Select rows with index labels 1 through 4 (inclusive)
display(iris.loc[1:4])

# Select rows 0 and 5, for columns sepal_width and species
display(iris.loc[[0, 5], ['sepal_width', 'species']])

Unnamed: 0,3
sepal_length,4.6
sepal_width,3.1
petal_length,1.5
petal_width,0.2
species,setosa


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Unnamed: 0,sepal_width,species
0,3.5,setosa
5,3.9,setosa


In [26]:
# --- Using .iloc ---
# Select the first row (position 0)
display(iris.iloc[0])

# Select rows at positions 1 through 4
display(iris.iloc[1:5])

# Select the first 3 rows (0, 1, 2) and the first 2 columns (0, 1)
display(iris.iloc[0:3, 0:2])

Unnamed: 0,0
sepal_length,5.1
sepal_width,3.5
petal_length,1.4
petal_width,0.2
species,setosa


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Unnamed: 0,sepal_length,sepal_width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2


(c) Logical Indexing

**Logical Indexing (Boolean Masking)**: Selecting rows based on a True/False condition. The DataFrame returns only the rows where the mask is True.
- **Ampersand `&`**: For AND (both conditions must be true).
- **Pipe** `|`: For OR (at least one condition must be true).
- Combining conditions: Wrap each individual condition in parentheses ().

In [45]:
display(iris['species'])

# 1. Create conditions (mask)
condition1 = iris['species'] == 'versicolor'
condition2 = iris['sepal_width'] < 2.5

# 2. Apply both masks to the DataFrame using '&'
narrow_versicolor = iris[condition1 & condition2]
display(narrow_versicolor)

# Done in one line with more conditions
large_virginica = iris[(iris['species'] == 'virginica') & (iris['sepal_width'] > 2.5)]
display(large_virginica)

Unnamed: 0,species
0,setosa
1,setosa
2,setosa
3,setosa
4,setosa
...,...
145,virginica
146,virginica
147,virginica
148,virginica


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
53,5.5,2.3,4.0,1.3,versicolor
57,4.9,2.4,3.3,1.0,versicolor
60,5.0,2.0,3.5,1.0,versicolor
62,6.0,2.2,4.0,1.0,versicolor
68,6.2,2.2,4.5,1.5,versicolor
80,5.5,2.4,3.8,1.1,versicolor
81,5.5,2.4,3.7,1.0,versicolor
87,6.3,2.3,4.4,1.3,versicolor
93,5.0,2.3,3.3,1.0,versicolor


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
100,6.3,3.3,6.0,2.5,virginica
101,5.8,2.7,5.1,1.9,virginica
102,7.1,3.0,5.9,2.1,virginica
103,6.3,2.9,5.6,1.8,virginica
104,6.5,3.0,5.8,2.2,virginica
105,7.6,3.0,6.6,2.1,virginica
107,7.3,2.9,6.3,1.8,virginica
109,7.2,3.6,6.1,2.5,virginica
110,6.5,3.2,5.1,2.0,virginica
111,6.4,2.7,5.3,1.9,virginica


(d) Categorical variables

**Categorical variables**: Represents distinct groups or categories. It has a limited, fixed number of possible values.
- **Object** vs **category**: Pandas interprets text-based columns as a `object` data type. For categorical data, convert the column to a `category` data type. This provides better memory savings, performance, boostm and enables specific plots and stats.
- `.astype()` **method**: Converts a column's data type.
- `.cat.categories` **method**: Returns the list of category names for a categorical column. The `.cat` part is a special accessor that gives access to category-specific attributes and methods.

In [51]:
# Check the original data type of the species column
display(iris['species'].dtype)

# Convert it to the category data type
iris['species'] = iris['species'].astype('category')

# Check new data type
display(iris['species'].dtype)

# Access categories
display(iris['species'].cat.categories)

CategoricalDtype(categories=['setosa', 'versicolor', 'virginica'], ordered=False, categories_dtype=object)

CategoricalDtype(categories=['setosa', 'versicolor', 'virginica'], ordered=False, categories_dtype=object)

Index(['setosa', 'versicolor', 'virginica'], dtype='object')

(e) Checking for or removing missing (NA) values

---

# 2: Exploratory Data Analysis

(a) Summary statistics and the five-number summary

i. Constructing boxplots

(b) Plotting one variable; two variables

i. Scatterplots, histograms, box plots, bar graphs

ii. When to use which plot

iii. Constructing boxplots

(c) Identifying outliers

(d) Components of a ggplot (Note: seaborn and Pandas plotting not included)

i. Geometries, aesthetics, facets, stats, scales, coords.

---

# 3. Data Transformation

(a) Data transformation verbs

i. query, filter, sort_values, assign, agg, groupby, nlargest, take

(b) The different joins

i. Inner, Left, Right, Outer

(c) Method chaining

(d) Tidy data

(e) Converting to tidy data

i. melt, pivot, str.split

---

# 4. Linear Regression/Modeling (LM)

(a) The LM equation

(b) Terminology: model, dependent/independent variables, error, residuals, slope, intercept

(c) Interpretation of coefficients of the LM equation

(d) Conditions necessary to apply LM

(e) Assumptions of LM and checking assumptions

i. Histogram of residuals

ii. Scatterplot of residuals

(f) Checking for outliers using LM

(g) Interpreting the quality of a LM

i. $R^2$ value

ii. Adjusted $R^2$ value

(h) Prediction using LM

i. Confidence and prediction intervals

(i) Advantages and disadvantages of LM

(j) LM with multiple independent variables

i. Categorical independent variables

ii. Note: Transformed independent variables not included

(k) LM in statsmodels: the sm.formula.ols() function and fit(), get_prediction(), summary() and summary_frame() methods and their outputs

(l) LM in scikit-learn: the LinearRegression object, intercept_ and coef_ attributes, and fit(), predict() and score() methods and their outputs