# Data and Feature Engineering with Python

This notebook will cover the major concepts presented in the lecture "Data and Feature Engineering with Python" by Dr. Dominik Jung. We will dive into different areas of data engineering, including data preprocessing, feature engineering, exploratory data analysis (EDA), and practical use of popular Python libraries such as Pandas, Scikit-learn, and Matplotlib. This notebook will also include exercises, examples, and insights to help you develop a deep understanding of the data engineering process.

### Prerequisites


Let's start with setting up our libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder, SimpleImputer
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

Then connect our Google Drive with Google Colab *(run only if you using Google Colab)*

In [None]:
# Acess your files from Google Drive in Colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### 1. Data Imports
Data imports are a foundational part of data engineering. Below, we explore how to import data from different sources and ensure it is in a usable form.

- **File Formats**: CSV, JSON, SQL Databases, etc.

In [None]:
# Example of reading a CSV file
data = pd.read_csv('data.csv')
print(data.head())

- **Handling Different File Types**: We often encounter data in delimited text files, JSON, databases, etc.


In [None]:
# Read JSON
json_data = pd.read_json('data.json')

# Using SQL Databases
import sqlite3
conn = sqlite3.connect('database.db')
sql_data = pd.read_sql_query("SELECT * FROM table_name", conn)

### 2. Exploratory Data Analysis
EDA helps understand the data, identify patterns, detect anomalies, and more.

**Please update the data variable based on your imports**

- **Descriptive Statistics**


In [None]:
# Generate summary statistics
data.describe()

- **Visualization**: Matplotlib is often used to visualize data.


In [None]:
# Histogram
plt.hist(data['column_name'])
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram of column_name')
plt.show()

- **Correlation and Scatterplots**


In [None]:
# Correlation matrix
corr_matrix = data.corr()
plt.matshow(corr_matrix, cmap='coolwarm')
plt.colorbar()
plt.show()

# Scatter plot to visualize relationships
plt.scatter(data['feature1'], data['feature2'])
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Scatter Plot of Feature 1 vs Feature 2')
plt.show()


### 3. Data Preprocessing
Data preprocessing is crucial for AI systems, as it can significantly affect model performance. Key steps include handling missing values, encoding categorical variables, and scaling.

- **Handling Missing Data**


In [None]:
# Filling missing values using SimpleImputer
imputer = SimpleImputer(strategy='mean')
data_imputed = imputer.fit_transform(data)

- **Encoding Categorical Variables**


In [None]:
# One-hot encoding
encoder = OneHotEncoder()
categorical_encoded = encoder.fit_transform(data[['categorical_feature']])


- **Standardization**


In [None]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

### 4. Feature Engineering
Feature engineering involves creating new features, transforming data, and selecting relevant features to improve model performance.

- **Feature Creation**: Creating new features from existing ones. For example, extracting day, month, or year from a datetime feature.


In [None]:
# Example: Creating a new feature
import datetime
data['year'] = pd.DatetimeIndex(data['date']).year
data['month'] = pd.DatetimeIndex(data['date']).month


- **The Curse of Dimensionality**: Feature selection can help mitigate the problems of high-dimensional datasets.


In [None]:
# Example of feature selection using Variance Threshold
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.1)
selected_features = selector.fit_transform(data)

### 5. Dimensionality Reduction
Dimensionality reduction is essential to avoid overfitting, reduce computational costs, and improve interpretability.

- **PCA (Principal Component Analysis)**: PCA helps reduce dimensions by projecting data onto the principal components.

In [None]:
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data)
plt.scatter(principal_components[:, 0], principal_components[:, 1])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Projection')
plt.show()


### 6. Exercises
- **Exercise 1**: Load the provided dataset and perform missing value imputation using median strategy.
- **Exercise 2**: Conduct exploratory data analysis on a dataset of your choice and visualize at least three different types of plots.
- **Exercise 3**: Apply one-hot encoding to a categorical variable and use PCA to reduce dimensionality.

### Extra Challenge:
Implement a function for target-based encoding and compare the performance of models using binary, target-based, and one-hot encoding.


In [None]:
# Example function for percentile calculation

def percentile(data, p):
    sorted_data = sorted(data)
    index = int(p / 100.0 * len(sorted_data))
    return sorted_data[index]

# Testing the function
sample_data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print("90th percentile:", percentile(sample_data, 90))


## Resources for Further Learning
- Official Python documentation: https://docs.python.org/3/
- Hands-on Machine Learning with Scikit-learn, Keras, and TensorFlow by Aurélien Géron
