<a href="https://colab.research.google.com/github/rutavmehta/Skill/blob/main/Skill.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Experiment 1: Dataset Loading**

Dataset loading is the process of importing data from a file (like CSV) into a structure (e.g., DataFrame in pandas) for analysis.

Printing information involves displaying details about the dataset, such as its first few rows, data types, shape, and summary statistics to understand its structure and content.

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv(r"/content/drive/MyDrive/archive.csv")

# Display first few rows
print("\n===== FIRST FEW ROWS =====")
print(df.head(200))

# Show dataset info and basic stats
print("\n===== DATAFRAME INFO =====")
print(df.info())

print("\n===== DATA SUMMARY =====")
print(df.describe())

**Experiment 2: Dataset Cleaning**

Dataset cleaning is the process of fixing or removing incorrect, incomplete, duplicate, or irrelevant data to ensure the dataset is accurate, consistent, and ready for analysis. This includes handling missing values, duplicates, and formatting errors.

In [None]:
# Drop rows with missing values and duplicates
df.dropna(axis=0, inplace=True)
df.drop_duplicates(inplace=True)

# Reset index for a clean look
df.reset_index(drop=True, inplace=True)

# Check for any remaining null values or duplicates
print("\n===== NULL VALUE CHECK =====")
print(df.isnull().any())

print("\n===== NULL VALUE COUNT =====")
print(df.isnull().sum())

print("\n===== DUPLICATES CHECK =====")
print(df.duplicated().sum())


**Experiment 3 - Part 1: Label Encoding**

Label encoding is the process of converting categorical data into numerical values, where each category is assigned a unique number — for example, "Male" becomes 0 and "Female" becomes 1. This helps machine learning models interpret categorical features that are ordinal or binary.

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Define binary categorical columns
binary_cols = ['Entrepreneurship']  # Assuming values are 'Yes' and 'No'

# Apply Label Encoding ('Yes' -> 1, 'No' -> 0)
label_encoder = LabelEncoder()
for col in binary_cols:
    df[col] = label_encoder.fit_transform(df[col])

# Display dataset after label encoding
print("\n===== Label Encoded Dataset =====")
print(df.head())


**Experiment 3 - Part 2: OneHot Encoding**

One-hot encoding is a technique that transforms categorical data into a series of binary columns, where each category gets its own column with values 0 or 1. For example, the "Color" feature with categories "Red," "Blue," and "Green" becomes three columns: Color_Red, Color_Blue, and Color_Green, where only one column is 1 (indicating the category) and the others are 0. This method prevents the model from misinterpreting categories as ordinal data.

In [None]:
# Define multi-class categorical columns
multi_class_cols = ['Gender', 'Field_of_Study', 'Current_Job_Level']

# Apply One-Hot Encoding (creates dummy variables)
df_onehot = pd.get_dummies(df, columns=multi_class_cols, drop_first=True)

# Display dataset after one-hot encoding
print("\n===== One-Hot Encoded Dataset =====")
print(df_onehot.head())


**Experiment 4 - Part 1: Standard Scaling**

Standard scaling is a data preprocessing technique that transforms numerical features to have a mean of 0 and a standard deviation of 1, ensuring all features contribute equally to the model.

In [None]:
from sklearn.preprocessing import StandardScaler

# Select numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Apply Standard Scaling
scaler_standard = StandardScaler()
df_standard_scaled = df.copy()
df_standard_scaled[numerical_cols] = scaler_standard.fit_transform(df[numerical_cols])

# Display the first few rows after Standard Scaling
print("\n===== Standard Scaled Data =====")
print(df_standard_scaled.head())


**Experiment 4 - Part 2: MinMax Scaling**

Min-Max Scaling is a data preprocessing technique that transforms numerical features to a fixed range, usually 0 to 1, by subtracting the minimum value and dividing by the range (max - min), preserving the original data distribution.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Apply Min-Max Scaling
scaler_minmax = MinMaxScaler()
df_minmax_scaled = df.copy()
df_minmax_scaled[numerical_cols] = scaler_minmax.fit_transform(df[numerical_cols])

# Display the first few rows after Min-Max Scaling
print("\n===== Min-Max Scaled Data =====")
print(df_minmax_scaled.head())


**Experiment 4 - Part 3: Normalising**

Normalization is a preprocessing technique that rescales data to have a standard range, typically between 0 and 1 or -1 and 1, ensuring all features contribute equally to the model and improving performance on algorithms that rely on distance measurements.

In [None]:
from sklearn.preprocessing import Normalizer

# Identify numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Apply L2 Normalization
normalizer = Normalizer(norm='l2')  # You can also use 'l1' or 'max' if needed
df_normalized = df.copy()
df_normalized[numerical_cols] = normalizer.fit_transform(df[numerical_cols])

# Display the first few rows after normalization
print("\n===== Normalized Data =====")
print(df_normalized.head())


**Experiment 5 and 6: Plotting**

Plotting is the process of visually representing data using charts, graphs, or plots to identify patterns, trends, and relationships, making data analysis easier and more intuitive.

A histogram is a type of bar chart that shows the distribution of a dataset by grouping data into bins and counting the frequency of values in each bin.

A scatter plot displays individual data points on a two-dimensional graph, showing the relationship between two variables to identify patterns, correlations, or outliers.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Visualization: Plot histograms and scatterplots
plt.figure(figsize=(10, 10))
sns.histplot(df["Age"], kde=True)
plt.title("Age Distribution")
plt.show()

plt.figure(figsize=(10, 10))
sns.histplot(df["SAT_Score"])
plt.title("SAT Score Distribution")
plt.show()

# Scatter plot between SAT Score and University GPA colored by Career Satisfaction
plt.figure(figsize=(10, 10))
sns.scatterplot(data=df, x="SAT_Score", y="University_GPA", hue="Career_Satisfaction")
plt.title("SAT Score vs University GPA (Career Satisfaction)")
plt.show()

# Scatter plot between High School GPA and University GPA colored by Career Satisfaction
plt.figure(figsize=(10, 10))
sns.scatterplot(data=df, x="High_School_GPA", y="University_GPA", hue="Career_Satisfaction")
plt.title("High School GPA vs University GPA (Career Satisfaction)")
plt.show()

# Plot histogram for the entire dataframe
plt.figure(figsize=(10, 10))
df.hist(bins="auto", figsize=(10, 10))
plt.suptitle("Histograms of All Features")
plt.show()


**Z-Score**

In [None]:
import pandas as pd
from scipy.stats import zscore

df = pd.read_csv(r"/content/drive/MyDrive/archive.csv")

# Calculate Z-scores for each numeric column
for col in df.columns:
    if df[col].dtype in ['int64', 'float64']:
        z_scores = zscore(df[col])
        print(f"Z-scores for column '{col}':")
        print(z_scores)

# Function to remove outliers using IQR
def remove_outliers_iqr(df):
    for col in df.columns:
        if df[col].dtype in ['int64', 'float64']:  # Only process numerical columns
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR

            # Filter out outliers
            df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return df

# Apply the function to remove outliers
df_no_outliers_iqr = remove_outliers_iqr(df)

# Function to remove outliers using Z-score
def remove_outliers_zscore(df, threshold=1):
    for col in df.columns:
        if df[col].dtype in ['int64', 'float64']:  # Only process numerical columns
            z_scores = zscore(df[col])
            df = df[abs(z_scores) <= threshold]
    return df

# Apply the function to remove outliers
df_no_outliers_zscore = remove_outliers_zscore(df)

# Save cleaned data
df_no_outliers_iqr.to_csv(r"/content/drive/MyDrive/hi2_iqr.csv", index=False)
df_no_outliers_zscore.to_csv(r"/content/drive/MyDrive/hi2_zscore.csv", index=False)

**Experiment 7 and 8 - Filter Method: Pearsons Correlation Test**

Pearson's correlation test measures the strength and direction of a linear relationship between two continuous variables, producing a value between -1 (perfect negative correlation) and +1 (perfect positive correlation), with 0 meaning no correlation.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

# Define numerical features and target column
target_column = 'Starting_Salary'  # Replace with actual numeric target
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Compute Pearson correlation for each feature with target
correlation_results = {}

print("\n===== Pearson Correlation Test =====")
for feature in numerical_features:
    if feature != target_column:
        corr, _ = pearsonr(df[feature], df[target_column])
        correlation_results[feature] = corr
        print(f"Feature: {feature}, Correlation: {corr}")

# Apply threshold-based filtering
threshold = 0.2  # Change this as needed (absolute correlation threshold)
filtered_features = {k: v for k, v in correlation_results.items() if abs(v) >= threshold}

# Display filtered features
print("\n===== Features Above Correlation Threshold =====")
for feature, corr in filtered_features.items():
    print(f"Feature: {feature}, Correlation: {corr}")

# Create correlation matrix
correlation_matrix = df[numerical_features].corr()

# Display correlation matrix
print("\n===== Correlation Matrix =====")
print(correlation_matrix)

# Plot heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Feature Correlation Matrix")
plt.show()


**Experiment 7 and 8 - Filter Method: ANOVA**

ANOVA (Analysis of Variance) is a statistical test that compares the means of two or more groups to determine if there’s a significant difference between them, helping to find relationships between categorical independent variables and a continuous dependent variable.

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import f_classif

# Define numerical features and target variable
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
target_column = 'Job_Offers'

# Ensure the target column is numeric (ANOVA requires numeric target categories)
df[target_column] = pd.to_numeric(df[target_column])

# Perform ANOVA F-Test
f_scores, p_values = f_classif(df[numerical_features], df[target_column])

# Display feature scores
print("\n===== ANOVA Feature Selection Results =====")
for feature, score, p_value in zip(numerical_features, f_scores, p_values):
    print(f"Feature: {feature}, Score: {score:.2f}, p-value: {p_value:.4f}")

# Optional: Filter features with a p-value less than 0.05 (statistically significant)
significant_features = [feature for feature, p in zip(numerical_features, p_values) if p < 0.05]

print("\nStatistically Significant Features (p < 0.05):")
print(significant_features)


**Experiment 7 and 8 - Filter Method: Chi-Square**

The Chi-Square test is a statistical test that measures the association between categorical variables by comparing the observed and expected frequencies in a contingency table to determine if they are independent.

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2

# Define categorical features and target variable
categorical_features = ['Gender', 'Field_of_Study', 'Current_Job_Level', 'Entrepreneurship']
target_column = 'Job_Offers'  # Ensure this is a categorical variable

# Convert categorical features into numerical using one-hot encoding
df_encoded = pd.get_dummies(df[categorical_features])

# Perform Chi-Square test using SelectKBest
k = 5  # Number of top features to select
selector = SelectKBest(score_func=chi2, k='all')  # Initially select all
X_new = selector.fit_transform(df_encoded, df[target_column])

# Get feature scores
feature_scores = selector.scores_
threshold = np.percentile(feature_scores, 50)  # Select features above median score

# Get top-K features
top_k_selector = SelectKBest(score_func=chi2, k=k)
X_k_new = top_k_selector.fit_transform(df_encoded, df[target_column])
selected_k_features = np.array(df_encoded.columns)[top_k_selector.get_support()]
top_k_scores = top_k_selector.scores_[top_k_selector.get_support()]

# Display results for threshold-based selection
print("\n===== Chi-Square Features (Above Threshold) =====")
for feature, score in zip(df_encoded.columns, feature_scores):
    if score >= threshold:
        print(f"Feature: {feature}, Score: {score}")

# Display results for top-K selection
print("\n===== Top-K Chi-Square Features =====")
for feature, score in zip(selected_k_features, top_k_scores):
    print(f"Feature: {feature}, Score: {score}")


**Experiment 7 and 8 - Filter Method: Information Gain**

Information Gain measures how much a feature reduces uncertainty (entropy) about the target variable. It helps determine which features contribute the most to predicting the outcome by comparing the entropy before and after splitting the data based on that feature — the higher the gain, the more important the feature.

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import mutual_info_classif, SelectKBest

# Define target variable
target_column = 'Career_Satisfaction'  # Replace with actual categorical target

# Encode categorical target
df[target_column] = df[target_column].astype('category').cat.codes

# Select numerical features
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Compute mutual information (information gain)
mi_scores = mutual_info_classif(df[numerical_features], df[target_column])

# Apply threshold-based filtering
threshold = np.percentile(mi_scores, 50)  # Select features above the 50th percentile

# Get top-K features
k = 5  # Number of top features to select
top_k_selector = SelectKBest(score_func=mutual_info_classif, k=k)
X_k_new = top_k_selector.fit_transform(df[numerical_features], df[target_column])
selected_k_features = np.array(numerical_features)[top_k_selector.get_support()]
top_k_scores = top_k_selector.scores_[top_k_selector.get_support()]

# Display results for threshold-based selection
print("\n===== Information Gain (Above Threshold) =====")
for feature, score in zip(numerical_features, mi_scores):
    if score >= threshold:
        print(f"Feature: {feature}, Information Gain: {score}")

# Display results for top-K selection
print("\n===== Top-K Information Gain Features =====")
for feature, score in zip(selected_k_features, top_k_scores):
    print(f"Feature: {feature}, Information Gain: {score}")


**Experiment 9 and 10 - Wrapper Method: Forward Selection**

Forward Selection is a step-by-step feature selection technique that starts with no features and adds one feature at a time — the one that improves the model performance the most — until adding more features no longer improves the model significantly.

In [None]:
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder

# Load data
df = pd.read_csv('/content/drive/MyDrive/archive.csv')

# Define features and target
X = df.drop('Job_Offers', axis=1)
y = df['Job_Offers']

# Encode categorical columns
for col in X.select_dtypes(include='object').columns:
    X[col] = LabelEncoder().fit_transform(X[col])

# Forward Selection
selected_features = []
remaining_features = list(X.columns)

for _ in range(len(remaining_features)):
    best_score = float('-inf')
    best_feature = None

    for feature in remaining_features:
        trial_features = selected_features + [feature]
        X_trial = sm.add_constant(X[trial_features])
        model = sm.OLS(y, X_trial).fit()

        if model.rsquared > best_score:
            best_score = model.rsquared
            best_feature = feature

    if best_feature is not None:
        selected_features.append(best_feature)
        remaining_features.remove(best_feature)

print("Selected Features (Forward Selection):", selected_features)


**Experiment 9 and 10 - Wrapper Method: Backward Elimination**

Backward Elimination is a feature selection technique that starts with all features and removes the least significant one at each step — based on a statistical measure (like p-values in regression) — until only the most important features remain.

In [None]:
import statsmodels.api as sm

# Start with all features
features = list(X.columns)
X_with_const = sm.add_constant(X[features])

while len(features) > 0:
    model = sm.OLS(y, X_with_const).fit()
    p_values = model.pvalues[1:]

    worst_p = p_values.idxmax()
    if p_values[worst_p] > 0.05:
        features.remove(worst_p)
        X_with_const = sm.add_constant(X[features])
    else:
        break

print("Selected Features (Backward Elimination):", features)


**Experiment 9 and 10 - Wrapper Method: Recursive Feature Elimination (RFE)**

Recursive Feature Elimination (RFE) is a feature selection technique that recursively removes the least important features based on a model (like a classifier or regressor) until the desired number of features is reached, improving model performance.

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
rfe = RFE(model, n_features_to_select=5)
fit = rfe.fit(X, y)

selected_features = X.columns[fit.support_].tolist()
print("Selected Features (RFE):", selected_features)


**Experiment 9 and 10 - Wrapper Method: Cross Validation (CV)**

Cross-validation (CV) is a technique to evaluate a model’s performance by splitting the data into multiple parts — training on some parts and testing on the remaining — to ensure the model generalizes well to unseen data.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print("Cross-Validation Scores:", cv_scores)
print("Average CV Score:", cv_scores.mean())
