<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_3/section_10_Python_Example__Technique_Selection_Based_on_Data_Type.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 10- Technique selection based on data type

Selecting the appropriate data analysis technique based on the type of data at hand is crucial for extracting meaningful insights and achieving accurate results. This section provides a practical example using Python to demonstrate how to select and apply different analysis techniques depending on whether the data is numeric, categorical, or text-based. We will utilize Python's rich ecosystem, including libraries like scikit-learn, pandas, and nltk, to handle various data types effectively.

## Part 1 - revision of the techniques covered this week

1. Setting Up the Environment:

Before starting, ensure Python and the necessary libraries are installed. If not, install them using pip:

In [None]:
pip install numpy pandas scikit-learn matplotlib nltk

2. Importing Required Libraries:

Import the libraries that will be used to handle different data types and perform data analysis:

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt

3. Preparing Different Data Types:

We'll create synthetic examples for three types of data: numeric, categorical, and text.

In [None]:
# Numeric Data: Generating a simple dataset with height (cm) and weight (kg)
np.random.seed(0)
data_numeric = pd.DataFrame({
    'Height': np.random.normal(loc=170, scale=10, size=100),
    'Weight': np.random.normal(loc=65, scale=15, size=100)
})

# Categorical Data: Creating a dataset with 'Gender' and 'Product' categories
data_categorical = pd.DataFrame({
    'Gender': ['Male', 'Female'] * 50,
    'Product': np.random.choice(['Product A', 'Product B', 'Product C'], 100)
})

# Text Data: Sample sentences for sentiment analysis
data_text = pd.Series([
    "I love this product",
    "This is the worst experience of my life",
    "I feel great about this",
    "This is not good",
    "Absolutely fantastic!"
])

4. Technique for Numeric Data: Standard Scaling and Gaussian Naive Bayes

For numeric data, we'll scale the data and apply a Gaussian Naive Bayes classifier to predict a simple outcome.

In [None]:
# Scale the numeric data
scaler = StandardScaler()
data_numeric_scaled = scaler.fit_transform(data_numeric)

# Assuming a binary target variable
target = np.random.choice([0, 1], size=100)

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(data_numeric_scaled, target, test_size=0.25, random_state=42)

# Initialize and train the Gaussian Naive Bayes classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict and evaluate the model
y_pred = gnb.predict(X_test)
print("Accuracy on numeric data:", accuracy_score(y_test, y_pred))

5. Technique for Categorical Data: Label Encoding

For categorical data, we'll use label encoding to convert categories into integers.

In [None]:
encoder = LabelEncoder()
data_categorical_encoded = data_categorical.apply(encoder.fit_transform)

# Display the encoded data
print(data_categorical_encoded.head())

6. Technique for Text Data: Sentiment Analysis

Using NLTK for sentiment analysis on text data.

In [None]:
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

# Applying sentiment analysis
data_text_sentiment = data_text.apply(lambda x: sia.polarity_scores(x)['compound'])

# Display sentiment scores
print(data_text_sentiment)

##Part 2 - EDA for technique selection

Choosing the correct data analysis technique based on data type is crucial for achieving accurate and meaningful insights. Now we will demonstrate a Python example that outlines how to programmatically identify data types within a dataset and select appropriate analysis techniques accordingly. We'll use Python's pandas library for data manipulation, matplotlib for visualization, and scikit-learn for implementing different machine learning algorithms.

1. Setting Up the Environment:

First, ensure you have Python installed with the necessary libraries. If not, install them using pip:

In [None]:
pip install numpy pandas matplotlib scikit-learn seaborn

2. Importing Required Libraries:

Import the libraries that will be used for data handling, visualization, and analysis:

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

3. Preparing a Sample Dataset:

For illustration, we'll create a sample DataFrame with mixed data types:

In [None]:
# Create a DataFrame with mixed data types
data = pd.DataFrame({
    'CustomerID': range(1, 101),
    'Age': np.random.randint(18, 70, size=100),
    'Income': np.random.normal(50000, 12000, size=100),
    'Gender': np.random.choice(['Male', 'Female'], size=100),
    'TextFeedback': np.random.choice(['Good', 'Bad', 'Neutral'], size=100)
})

4. Analyzing Data Types and Selecting Techniques:

We will programmatically check data types and plot pairwise relationships for numeric data:

In [None]:
# Analyze data types
data_types = data.dtypes

# Print data types
print(data_types)

# Select numerical columns for pairwise plot
numerical_cols = data.select_dtypes(include=[np.number])

# Use seaborn to plot pairwise relationships
sns.pairplot(numerical_cols)
plt.show()

# For categorical data, we could use label encoding or one-hot encoding
# Here we decide based on number of unique categories
if data['Gender'].nunique() <= 2:
    data['Gender_encoded'] = data['Gender'].map({'Male': 0, 'Female': 1})
else:
    data = pd.get_dummies(data, columns=['Gender'])

# Decision based on whether to apply clustering or classification
if 'Outcome' in data.columns:
    # Assume 'Outcome' is a binary classification target
    X = data.drop('Outcome', axis=1)
    y = data['Outcome']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = RandomForestClassifier()
    model.fit(X_train, y_train)
    print("Classification model trained.")
else:
    # Apply clustering if no apparent target variable is defined
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(numerical_cols)
    kmeans = KMeans(n_clusters=3)
    clusters = kmeans.fit_predict(scaled_data)
    print("Clustering performed.")

##Stop - where has our data gone?

This example has shown how to programmatically inspect a dataset to determine the data types and choose appropriate analysis techniques based on the characteristics of the data. By integrating such logic, Python scripts can be made more flexible and adaptable to varying datasets. Properly identifying the type and structure of data before applying any analysis techniques ensures that the methods chosen are suitable and that the insights generated are both reliable and actionable. However, you will have noted that in the seaborn pairs plot above we were only able to include those variables that are numerical. This is one point where R has the advantage on python, with its built in pairs() function, which seamlessly handles mixed data types. To solve this problem we will combine what we have covered in this section.

In Python, creating a comprehensive pairs plot that includes both numerical and categorical data is possible but requires a bit more manual work compared to R's pairs() function. A similar approach can be taken to handle categorical data in a pairs plot by manually converting categorical variables to numerical format before plotting. This can be done using encoding techniques such as label encoding, where each category is assigned a unique integer. Python's seaborn library provides a function called pairplot which is typically used for plotting pairwise relationships in a dataset but is primarily designed for numerical data.

Let’s modify the previous example with the addition of converting categorical variables to numerical and use seaborn.pairplot to create a pairs plot in Python:

4. Converting Categorical Data to Numeric: Use label encoding to convert categorical variables into numbers. This can be achieved using LabelEncoder from sklearn.preprocessing or with Pandas functionalities:

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize label encoder
label_encoder = LabelEncoder()

# Apply label encoder for each categorical column
data['Gender'] = label_encoder.fit_transform(data['Gender'])
data['TextFeedback'] = label_encoder.fit_transform(data['TextFeedback'])

# Now all data is numeric
print(data.head())

5. Generating the Pair Plot: Now that all data is numeric, you can easily generate a pairs plot using seaborn.pairplot:

In [None]:
sns.pairplot(data)
plt.show()

This plot will include all variables in the data, showing scatter plots for all pairs of numerical data and histograms for the distribution of each single variable. Do you think this was more useful? Are there any better approaches to EDA?

## Part 3 - Over to you...

Please analyse a dataset of your choice using the techniques we have discussed. You coudl use one of the datasets we have already encountered in this course, or explore one of the build in data sets in Scikit learn - we have explored how to look at these in todays pattern reocognition notebook.