<a href="https://colab.research.google.com/github/cloudpedagogy/data-science-programming/blob/main/machine-learning-scikit-learn/04_Exploratory_Data_Analysis_(EDA).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Exploratory Data Analysis (EDA)


##Overview


Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that empowers data scientists and analysts to gain valuable insights, understand the underlying patterns, and detect potential anomalies within a dataset. It serves as the foundation for making informed decisions and forming hypotheses before diving into more complex modeling and statistical techniques. EDA is an essential practice in the data science workflow and is typically performed as a preliminary step before any formal modeling or hypothesis testing.

In Python, EDA can be efficiently performed using a wide range of libraries and tools, making it a popular choice among data professionals. Some of the widely used Python libraries for EDA include Pandas, NumPy, Matplotlib, Seaborn, and Plotly. These libraries offer robust data manipulation, visualization, and statistical capabilities, enabling data scientists to explore and interact with their data effectively.

During the EDA process, data is thoroughly examined to understand its structure, distribution, and relationships between variables. Data cleaning and preprocessing steps are also taken to handle missing values, outliers, and inconsistencies that might affect the analysis. Exploring summary statistics, histograms, scatter plots, and correlation matrices are common techniques employed during this phase.

EDA plays a crucial role in identifying potential challenges and limitations in the dataset. It helps in uncovering data quality issues and can often reveal the need for further data collection or refinement. By visualizing and summarizing the data, analysts can also generate hypotheses about the relationships between variables, which can later be tested through more advanced statistical methods or machine learning algorithms.

Moreover, EDA facilitates the identification of patterns and trends that might lead to actionable insights and provide a deeper understanding of the data's characteristics. It also helps in selecting appropriate features for predictive models and deciding on the most suitable data transformation techniques.



In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Select a subset of features for descriptive statistics
features = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]

# Standardize the features using StandardScaler
scaler = StandardScaler()
dataset_scaled = scaler.fit_transform(dataset[features])

# Calculate the descriptive statistics
mean_values = dataset_scaled.mean(axis=0)
std_values = dataset_scaled.std(axis=0)
min_values = dataset_scaled.min(axis=0)
max_values = dataset_scaled.max(axis=0)
median_values = np.median(dataset_scaled, axis=0)
quartiles = np.percentile(dataset_scaled, [25, 50, 75], axis=0)

# Print the descriptive statistics
print("Mean values:", mean_values)
print("Standard deviation values:", std_values)
print("Minimum values:", min_values)
print("Maximum values:", max_values)
print("Median values:", median_values)
print("Quartiles:", quartiles)


In this example, we first load the Pima Indian Diabetes dataset using Pandas library. We then select a subset of features on which we want to calculate the descriptive statistics. To ensure meaningful comparisons across different features, we standardize the features using `StandardScaler` from scikit-learn. We fit the scaler on the selected features and transform the dataset to obtain the standardized values in `dataset_scaled`.

After that, we use NumPy functions to calculate the descriptive statistics. The `mean` function calculates the mean values along the 0th axis (column-wise), `std` function calculates the standard deviation, `min` and `max` functions find the minimum and maximum values, `median` function calculates the median values, and `percentile` calculates the quartiles (25th, 50th, and 75th percentiles).

Finally, we print the calculated descriptive statistics for each feature.


##Data Visualization

Scikit-Learn (sklearn) is primarily a machine learning library in Python, and it doesn't provide built-in data visualization capabilities. However, it does offer various preprocessing and modeling techniques. To perform data visualization, you can use other libraries such as Matplotlib or Seaborn in conjunction with Scikit-Learn. Let's look at an example using the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Plot a histogram of the Glucose levels
plt.figure(figsize=(8, 6))
sns.histplot(data=dataset, x='Glucose', kde=True, hue='Outcome')
plt.title("Histogram of Glucose levels")
plt.xlabel("Glucose")
plt.ylabel("Count")
plt.show()

# Plot a correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = dataset.corr()
sns.heatmap(correlation_matrix, annot=True, cmap="YlGnBu")
plt.title("Correlation Matrix")
plt.show()


In this example, we load the Pima Indian Diabetes dataset using Pandas library. We then use Matplotlib and Seaborn libraries for data visualization.

The first visualization is a histogram of the Glucose levels, where we use `sns.histplot` to create a histogram. We set `kde=True` to overlay a kernel density estimate plot and use the `hue='Outcome'` parameter to differentiate the distribution of Glucose levels for different outcomes (diabetic or non-diabetic).

The second visualization is a correlation matrix, where we calculate the correlation between each pair of variables in the dataset using the `corr` method. We use `sns.heatmap` to create a heatmap of the correlation matrix, setting `annot=True` to display the correlation values, and using the colormap "YlGnBu" for visual representation.

These examples demonstrate how to perform data visualization using Matplotlib and Seaborn alongside Scikit-Learn for data analysis and modeling tasks.


##Feature selection and dimensionality reduction


Feature selection is an important step in machine learning to identify the most relevant features or variables that contribute significantly to the prediction or target variable. Scikit-Learn, a popular machine learning library in Python, provides several techniques for feature selection. Here are a few examples using the Pima Indian Diabetes dataset:

1. **Univariate Selection**:
Univariate selection evaluates each feature individually to determine its relationship with the target variable. Scikit-Learn provides the `SelectKBest` class along with various scoring functions to select a specific number (K) of features based on their scores. For example, let's use the chi-square test to select the top 5 features from the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Separate the features and target variable
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]

# Apply SelectKBest and chi2 scoring to select top 5 features
selector = SelectKBest(score_func=chi2, k=5)
X_new = selector.fit_transform(X, y)

# Get the selected feature indices
selected_indices = selector.get_support(indices=True)
selected_features = X.columns[selected_indices]

# Print the selected features
print("Selected Features:")
print(selected_features)


In this example, we use the `SelectKBest` class with the `chi2` scoring function to select the top 5 features from the Pima Indian Diabetes dataset. We separate the features (X) and the target variable (y) from the dataset. Then, we apply the `fit_transform()` method to select the top 5 features based on their chi-square scores. Finally, we retrieve the selected feature indices and print the corresponding feature names.

2. **Feature Importance with Random Forest**:
Random Forest is an ensemble algorithm that can provide a measure of the importance of each feature in predicting the target variable. Scikit-Learn's `RandomForestClassifier` or `RandomForestRegressor` can be used to estimate feature importance. Here's an example using a Random Forest classifier:


In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Separate the features and target variable
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]

# Create a Random Forest classifier
rf = RandomForestClassifier()

# Fit the classifier to the data
rf.fit(X, y)

# Get feature importances
feature_importances = rf.feature_importances_

# Create a dataframe with feature names and their importances
feature_importances_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})

# Sort the features by importance in descending order
sorted_features = feature_importances_df.sort_values(by='Importance', ascending=False)

# Print the sorted features
print("Sorted Features:")
print(sorted_features)


In this example, we use a Random Forest classifier to estimate feature importance. We separate the features (X) and the target variable (y) from the dataset. Then, we create a `RandomForestClassifier` and fit it to the data. We retrieve the feature importances using the `feature_importances_` attribute. Next, we create a dataframe with feature names and their importances and sort them in descending order. Finally, we print the sorted features based on their importance.

These are just a couple of examples of feature selection techniques in Scikit-Learn. The library provides many other methods like Recursive Feature Elimination (RFE), L1-based feature selection, and more. The choice of the technique depends on the specific problem and the characteristics of the dataset.


##Dimensionality Reduction
Dimensionality reduction is a technique used to reduce the number of features (dimensions) in a dataset while retaining the most important information. Scikit-Learn, a popular machine learning library in Python, provides several methods for dimensionality reduction.

Two commonly used techniques for dimensionality reduction in Scikit-Learn are Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). Here's an example of how you can use these techniques with the Pima Indian Diabetes dataset:

1. Principal Component Analysis (PCA):


In [None]:
import pandas as pd
from sklearn.decomposition import PCA

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Separate features and target variable
X = dataset.drop("Outcome", axis=1)
y = dataset["Outcome"]

# Perform PCA with 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Print the transformed data
print(X_pca)


In this example, we first load the Pima Indian Diabetes dataset using Pandas. Then, we separate the features (X) and the target variable (y). We initialize a PCA object and set the number of components to 2. We fit the PCA model to the features data (X) and transform it to obtain the reduced dimensionality representation (X_pca) with 2 components. Finally, we print the transformed data.

2. t-Distributed Stochastic Neighbor Embedding (t-SNE):


In [None]:
import pandas as pd
from sklearn.manifold import TSNE

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Separate features and target variable
X = dataset.drop("Outcome", axis=1)
y = dataset["Outcome"]

# Perform t-SNE with 2 components
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)

# Print the transformed data
print(X_tsne)


In this example, we again load the Pima Indian Diabetes dataset and separate the features (X) and the target variable (y). We create a t-SNE object and set the number of components to 2. We fit the t-SNE model to the features data (X) and transform it to obtain the reduced dimensionality representation (X_tsne) with 2 components. Finally, we print the transformed data.

Note: Both PCA and t-SNE are unsupervised techniques and do not take the target variable into account during the dimensionality reduction process.


#Reflection points

**1. Descriptive Statistics:**
- What are the key measures used in descriptive statistics to summarize and describe data?
  - Sample answer: Key measures include measures of central tendency (mean, median, mode) and measures of variability (range, variance, standard deviation).

- How can descriptive statistics be used to gain insights into a dataset?
  - Sample answer: Descriptive statistics help in understanding the distribution, spread, and central values of data, enabling us to identify patterns, outliers, and key characteristics.

- What Python libraries can be used to perform descriptive statistics?
  - Sample answer: Python libraries like NumPy, Pandas, and SciPy provide functions and methods to compute descriptive statistics on datasets.

**2. Data Visualization:**
- Why is data visualization important in data analysis?
  - Sample answer: Data visualization helps in effectively communicating patterns, trends, and insights hidden within the data. It aids in understanding complex information and making informed decisions.

- What are some popular Python libraries for data visualization?
  - Sample answer: Matplotlib, Seaborn, and Plotly are commonly used Python libraries for creating various types of visualizations, including bar plots, scatter plots, histograms, and more.

- How can data visualization techniques be used to explore and present data effectively?
  - Sample answer: Data visualization techniques help in understanding relationships between variables, identifying outliers, detecting trends, and conveying findings in a visually appealing manner.

**3. Feature Selection:**
- Why is feature selection important in machine learning and data analysis?
  - Sample answer: Feature selection helps in improving model performance, reducing complexity, and enhancing interpretability by selecting the most relevant and informative features for a given task.

- What are some commonly used feature selection techniques?
  - Sample answer: Techniques like filter methods (e.g., correlation, information gain), wrapper methods (e.g., recursive feature elimination), and embedded methods (e.g., LASSO regression) are commonly used for feature selection.

- How can feature selection be implemented using Python?
  - Sample answer: Python libraries like Scikit-learn provide functions and classes for feature selection, including methods like SelectKBest, SelectFromModel, and Recursive Feature Elimination.

**4. Dimensionality Reduction:**
- What is the purpose of dimensionality reduction in data analysis?
  - Sample answer: Dimensionality reduction aims to reduce the number of features while preserving important information, thereby addressing the curse of dimensionality, improving efficiency, and aiding visualization.

- What are the main techniques for dimensionality reduction?
  - Sample answer: Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) are widely used techniques for dimensionality reduction.

- How can dimensionality reduction be implemented in Python?
  - Sample answer: Python libraries like Scikit-learn provide classes for dimensionality reduction, including PCA and t-SNE, which can be easily applied to datasets.


#A quiz on Exploratory Data Analysis (EDA)



1. Which type of data visualization is best suited for showing the distribution of a single continuous variable?
   <br>a) Scatter plot
   <br>b) Bar chart
   <br>c) Line plot
   <br>d) Histogram

2. Which data visualization technique is used to show the relationship between two continuous variables?
   <br>a) Bar chart
   <br>b) Scatter plot
   <br>c) Pie chart
   <br>d) Box plot

**Feature Selection and Dimensionality Reduction:**

3. What is the main purpose of feature selection in machine learning?
   <br>a) To increase the number of features in the dataset
   <br>b) To reduce the number of features in the dataset
   <br>c) To transform features into a higher-dimensional space
   <br>d) To remove outliers from the dataset

4. Principal Component Analysis (PCA) is used for:
   <br>a) Feature scaling
   <br>b) Dimensionality reduction
   <br>c) Feature extraction
   <br>d) Outlier detection

**Dimensionality Reduction in Python:**

5. Which Python library provides functions for Principal Component Analysis (PCA) and other dimensionality reduction techniques?
   <br>a) Scikit-learn
   <br>b) TensorFlow
   <br>c) PyTorch
   <br>d) Pandas

6. What does the explained variance in PCA represent?
   <br>a) The percentage of the original information retained in the data after dimensionality reduction
   <br>b) The number of features in the original dataset
   <br>c) The ratio of the training set size to the testing set size
   <br>d) The number of iterations required for PCA convergence
---
**Answers:**

1. d) Histogram
2. b) Scatter plot
3. b) To reduce the number of features in the dataset
4. b) Dimensionality reduction
5. a) Scikit-learn
6. a) The percentage of the original information retained in the data after dimensionality reduction
---