###Exercise 1: Loading Data with Pandas

1. Objective: Learn how to load and inspect datasets using Pandas.

2. Steps:

Import the Pandas library and load a CSV file into a DataFrame.

In [8]:
import pandas as pd
df = pd.read_csv('heart_failure_dataset.csv')

Use the head(), tail(), and info() functions to inspect the dataset.

In [7]:
df.head()
df.tail()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


Check for missing values and data types of each column using isnull() and dtypes.

In [3]:
df.isnull().sum()
df.dtypes

age                         float64
anaemia                       int64
creatinine_phosphokinase      int64
diabetes                      int64
ejection_fraction             int64
high_blood_pressure           int64
platelets                   float64
serum_creatinine            float64
serum_sodium                  int64
sex                           int64
smoking                       int64
time                          int64
DEATH_EVENT                   int64
dtype: object

How do you load a CSV file into a Pandas DataFrame?

3. Questions:

How do you load a CSV file into a Pandas DataFrame?
You can use pd.read_csv('filename.csv') to load a CSV file into a DataFrame.

What information does the info() function provide about the dataset?
It shows column names, number of non-null entries, and the data types of each column.

How can you identify missing values in the dataset?
Use df.isnull().sum() to see how many missing values exist in each column.

###Exercise 2: Handling Missing Data

1. Objective: Practice techniques for handling missing data in a dataset.

2. Steps:

Identify missing values in the dataset using isnull().sum().

In [4]:
df.isnull().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

Use different strategies to handle missing data:
    Remove rows with missing values using dropna().
    Fill missing values with the mean, median, or a specific value using fillna().
    Use forward or backward filling (ffill() or bfill()) to fill missing data.

In [5]:
df.dropna()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


In [10]:
print(df.columns)

Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
       'DEATH_EVENT'],
      dtype='object')


In [14]:
df['age'] = df['age'].fillna(df['age'].mean())
df['age'] = df['age'].fillna(df['age'].median())

In [None]:
df.fillna(method='ffill')
df.fillna(method='bfill')

3. Questions:
What strategy did you use to handle missing values, and why?
It depends on the context, but common strategies include filling missing values with the mean for numerical data or using forward/backward fill for time-series data.

How did filling missing values affect the dataset?
Filling with statistical values (mean/median) retains the dataset size, while dropping rows with missing data reduces it.

When might it be more appropriate to drop rows with missing values instead of filling them?
When the percentage of missing data is small, or when filling the values would introduce bias or inaccuracies.

###Exercise 3: Data Transformation

1. Objective: Transform data to prepare it for analysis.

2. Steps:
Normalize numerical features using Min-Max scaling or Z-score standardization with sklearn.preprocessing.

In [9]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['column']] = scaler.fit_transform(df[['column']])

KeyError: "None of [Index(['column'], dtype='object')] are in the [columns]"

Encode categorical variables using one-hot encoding with pd.get_dummies() or sklearn.preprocessing.OneHotEncoder.

In [None]:
df = pd.get_dummies(df, columns=['category_column'])

Use pd.cut() to bin continuous variables into discrete intervals.

In [None]:
df['binned'] = pd.cut(df['column'], bins=5)

3. Questions:
What is the difference between normalization and standardization?
Normalization scales data between 0 and 1, while standardization transforms data to have a mean of 0 and standard deviation of 1.

How does one-hot encoding transform categorical variables?
It converts categorical columns into binary columns for each category, assigning a 1 where the category is present and 0 otherwise.

Why might you want to bin continuous variables into categories?
Binning can simplify data and make patterns more apparent, especially for models that don't assume linear relationships.

###Exercise 4: Feature Engineering 

1. Objective: Create new features to improve the predictive power of a dataset.

2. Steps:
Create new features by combining or transforming existing features (e.g., adding interaction terms or polynomial features).

In [None]:
df['interaction_feature'] = df['feature1'] * df['feature2']  # Interaction term
df['polynomial_feature'] = df['feature1'] ** 2  # Polynomial term

Extract date-based features (e.g., year, month, day) from datetime columns using pd.to_datetime() and dt accessor.

In [None]:
df['date_column'] = pd.to_datetime(df['date_column'])
df['year'] = df['date_column'].dt.year
df['month'] = df['date_column'].dt.month
df['day'] = df['date_column'].dt.day

Use domain knowledge to engineer features that might be useful for your specific problem.

In [None]:
df['is_weekend'] = df['date_column'].dt.weekday >= 5

3. Questions:
What new features did you create, and why?
Interaction terms or polynomial features to capture relationships between variables, or date features to capture trends over time.

How did the new features improve the dataset?
New features can reveal relationships that improve model performance.

How can date-based features be useful in a dataset?
Date-based features help models understand seasonal trends, changes over time, etc.

###Exercise 5: Data Cleaning###

1. Objective: Clean data to ensure it's ready for analysis.

2. Steps:
Remove duplicate rows using drop_duplicates().

In [None]:
df.drop_duplicates()

Detect and remove outliers using the Z-score method or the IQR method.

Z-score method:

In [None]:
from scipy import stats
df = df[(np.abs(stats.zscore(df['column'])) < 3)]

IQR method:

In [None]:
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df['column'] < (Q1 - 1.5 * IQR)) | (df['column'] > (Q3 + 1.5 * IQR)))]

Correct inconsistencies in categorical data (e.g., standardizing text formats or merging similar categories).

In [None]:
df['category_column'] = df['category_column'].str.lower()
df['category_column'].replace({'cat_a': 'category_a', 'cat_b': 'category_b'}, inplace=True)

3. Questions:
How did you identify and handle duplicate rows in the dataset?
Use df.duplicated() to find duplicates and remove them with drop_duplicates().

What method did you use to detect and remove outliers, and why?
Z-score or IQR methods help identify values far from the mean that could distort model training.

How did you address inconsistencies in categorical data?
Standardized categories by converting all text to lowercase or merging similar categories.

###Exercise 6: Splitting Data into Training and Testing Sets

1. Objective: Prepare the data for model training by splitting it into training and testing sets.

2. Steps:
Use sklearn.model_selection.train_test_split() to split the dataset into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split
X = df.drop('target_column', axis=1)
y = df['target_column']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Ensure that the target variable is correctly separated from the features.

Explore the impact of different train-test split ratios (e.g., 70-30, 80-20) on model performance.

3. Questions:
How do you split a dataset into training and testing sets in Python?
Use train_test_split() from sklearn to separate data into training and testing sets.

What considerations should you keep in mind when choosing a train-test split ratio?
The split should balance enough training data for learning and enough test data to evaluate generalization.

How does the size of the training set impact the model's ability to generalize?
A larger training set helps the model learn patterns but might leave less data to assess performance accurately.

###Exercise 7: Data Preprocessing Pipeline

1. Objective: Build a preprocessing pipeline to automate the data preparation process.

2. Steps:
Use sklearn.pipeline.Pipeline to create a pipeline that includes steps such as missing value imputation, feature scaling, and encoding categorical variables.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

numeric_features = ['num_feature1', 'num_feature2']
categorical_features = ['cat_feature1']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

Fit the pipeline to the training data and transform the test data.

In [None]:
pipeline.fit(X_train)
X_train_transformed = pipeline.transform(X_train)
X_test_transformed = pipeline.transform(X_test)

Integrate the preprocessing pipeline with a machine learning model for end-to-end training and evaluation.

In [None]:
from sklearn.linear_model import LogisticRegression
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', LogisticRegression())])
model.fit(X_train, y_train)
model.score(X_test, y_test)

3. Questions:
What are the benefits of using a preprocessing pipeline?
A pipeline streamlines the preprocessing steps and ensures consistency between training and testing data.

How does the pipeline ensure consistency between training and test data transformations?
It applies the same transformations (e.g., scaling, encoding) to both datasets.

How can you extend the pipeline to include additional preprocessing steps?
You can add steps for missing value handling, encoding, feature generation, etc., directly to the pipeline.