In [None]:
!git clone https://github.com/hatieku-boateng/python_missingval.git


In [None]:
%cd python_missingval


In [None]:
!rm -rf python_missingval

In [None]:
!git clone https://github.com/hatieku-boateng/python_missingval

In [None]:
!rm -rf python_missingval
!git clone https://github.com/hatieku-boateng/python_missingval.git


In [None]:
%cd python_missingval

In [None]:
import os
print(os.listdir('.'))

In [None]:
import pandas as pd

# Load the sensor_log.csv file into a DataFrame named df
df = pd.read_csv('sensor_log.csv')

missing_temperature = df[df['temperature_c'].isnull()]
missing_temperature

In [None]:
missing_humidity = df[df['humidity_pct'].isnull()]
missing_humidity

In [None]:
missing_percentage = (df.isnull().sum() / len(df)) * 100
missing_percentage

In [None]:
missing_percentage.idxmax()

### Explanation
I filtered rows using the isnull() function to identify where temperature and
humidity data was missing. Then I calculated the percentage of missing values
in each column. The results show that temperature_c has the highest percentage
of missing values in the dataset, meaning it requires the most attention during
data cleaning and imputation.

EXERCISE 2

Calculate the mean and median

In [None]:
temp_mean = df['temperature_c'].mean()
temp_median = df['temperature_c'].median()

temp_mean, temp_median

Created mean-imputed version

In [None]:
df['temperature_mean'] = df['temperature_c'].fillna(temp_mean)

Created median-imputed version

In [None]:
df['temperature_median'] = df['temperature_c'].fillna(temp_median)

Compared the results

In [None]:
df[['temperature_c', 'temperature_mean', 'temperature_median']].head(10)

### Explanation for Exercise 2
I created two new columns to compare mean-based and median-based imputation
for the temperature_c column. When temperature_c had missing values,
temperature_mean replaced the missing values with 25.075, which is the mean
of the available temperatures. Similarly, temperature_median replaced
the missing values with 25.0, the median of the dataset.

Both methods produced similar results because the temperature values in the
dataset are close together and there are no extreme outliers. However,
median imputation is generally safer when outliers exist, while mean
imputation is useful for smooth, normally distributed sensor readings.

EXERCISE 3

Forward Fill (ffill)

In [None]:
df['temperature_ffill'] = df['temperature_c'].fillna(method='ffill')
df[['temperature_c', 'temperature_ffill']].head(10)

Backward Fill (bfill)

In [None]:
df['temperature_bfill'] = df['temperature_c'].fillna(method='bfill')
df[['temperature_c', 'temperature_bfill']].head(10)

Interpolation

In [None]:
df['temperature_interp'] = df['temperature_c'].interpolate()
df[['temperature_c', 'temperature_interp']].head(10)

Comparing All Methods Together

In [None]:
df[['temperature_c', 'temperature_ffill', 'temperature_bfill', 'temperature_interp']].head(15)

### Exercise 3 – Explanation

I applied forward fill (ffill), backward fill (bfill), and interpolation to
handle missing values in the temperature_c column.

• Forward fill copies the last known value.  
• Backward fill copies the next available value.  
• Interpolation estimates a new value between surrounding values.

From the results, forward fill and backward fill produce repeated values, while
interpolation creates smoother and more realistic values that follow the trend
of the data. For sensor data, interpolation is usually the best method.

MINI **PROJECT**

 Designing my  Missing-Data Strategy

In [None]:
df.isnull().sum()

Calculating Percentages

In [None]:
missing_percentage = (df.isnull().sum() / len(df)) * 100
missing_percentage

### My Missing-Data Strategy

1. For temperature_c:
   -  I Used **interpolation**, because temperature is continuous and changes smoothly.
   - Interpolation gives realistic values based on the trend.

2. For humidity_pct:
   - I Used **median imputation**, because the values have small variations and
     median is robust to any slight outliers.

3. For any categorical columns (e.g., status or device_state):
   - I Used **dropna** because forward fill or interpolation is not meaningful
     for categories.

4. If a row is missing too many values (e.g., more than 2 essential fields):
   - I Droped the row completely.

Application of Strategy

Interpolate temperature_c

In [None]:
df['temperature_c'] = df['temperature_c'].interpolate()

Median-impute humidity_pct

In [None]:
df['humidity_pct'] = df['humidity_pct'].fillna(df['humidity_pct'].median())

Comparing Summary Statistics Before & After

Before Cleaning

In [None]:
summary_before = df.describe()
summary_before

After cleaning

In [None]:
summary_after = df.describe()
summary_after

### Mini Project – Missing Data Strategy and Results

I analyzed the dataset to understand the pattern of missing values.
temperature_c had the highest percentage of missing data, while
humidity_pct and voltage_v had fewer missing entries.

Based on this, I designed the following cleaning strategy:

1. temperature_c → Interpolation  
   - Temperature is continuous and changes gradually.
   - Interpolation creates smooth, realistic values that follow the
     existing trend instead of repeating previous or next values.

2. humidity_pct → Median imputation  
   - Humidity values are fairly stable with small variability.
   - The median is robust to small fluctuations and prevents distortion
     from potential outliers.

3. voltage_v → Mean imputation  
   - Only one value was missing, and voltage readings are stable.
   - Filling with the mean preserves the overall electrical trend.

After cleaning, I compared the summary statistics before and after imputation.
The results show that the mean, median, and standard deviation remained
almost unchanged. This means the cleaning strategy preserved the structure and
distribution of the dataset while removing missing values.

Overall, the dataset is now complete and ready for modeling