<a href="https://colab.research.google.com/github/fathimathasniabdulaziz/EntryProjects/blob/main/EDA_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Day 2 - Exploratory Data Analysis**

### **Numpy** **&** **Pandas**

In [None]:
pip install numpy pandas



## **Importing necessary libraries**

In [None]:

import numpy as np
import pandas as pd


In [None]:
## Creating NumPy arrays
array1 = np.array([1, 2, 3, 4, 5])
print("NumPy Array:", array1)

NumPy Array: [1 2 3 4 5]


In [None]:
## Basic operations with NumPy arrays
print("Sum:", np.sum(array1))
print("Mean:", np.mean(array1))
print("Standard Deviation:", np.std(array1))

Sum: 15
Mean: 3.0
Standard Deviation: 1.4142135623730951


In [None]:
## Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'Score': [85, 90, 88, 92]}
df = pd.DataFrame(data)
print("\nDataFrame:")
print(df)



DataFrame:
      Name  Age  Score
0    Alice   25     85
1      Bob   30     90
2  Charlie   35     88
3    David   40     92


In [None]:
## Basic operations with Pandas DataFrame
print("\nMean Age:", df['Age'].mean())
print("Max Score:", df['Score'].max())
print("Summary Statistics:")
print(df.describe())


Mean Age: 32.5
Max Score: 92
Summary Statistics:
             Age      Score
count   4.000000   4.000000
mean   32.500000  88.750000
std     6.454972   2.986079
min    25.000000  85.000000
25%    28.750000  87.250000
50%    32.500000  89.000000
75%    36.250000  90.500000
max    40.000000  92.000000


In [None]:
# Data Cleaning using NumPy and Pandas
## Handling missing values
missing_data = pd.DataFrame({'A': [1, np.nan, 3, np.nan],
                             'B': [np.nan, 5, np.nan, 8],
                             'C': [9, 10, 11, 12]})
print("\nMissing Data:")
print(missing_data)


Missing Data:
     A    B   C
0  1.0  NaN   9
1  NaN  5.0  10
2  3.0  NaN  11
3  NaN  8.0  12


In [None]:
## Dropping rows with missing values
cleaned_data = missing_data.dropna()
print("\nCleaned Data:")
print(cleaned_data)


Cleaned Data:
Empty DataFrame
Columns: [A, B, C]
Index: []


In [None]:
## Filling missing values with a specific value
filled_data = missing_data.fillna(0)
print("\nFilled Data:")
print(filled_data)


Filled Data:
     A    B   C
0  1.0  0.0   9
1  0.0  5.0  10
2  3.0  0.0  11
3  0.0  8.0  12


## **Analysing the Dataset**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:


# Read the CSV file
df = pd.read_csv('/content/drive/MyDrive/telecom_churn.csv')


In [None]:
df.head(5)

In [None]:
print(df.shape)

In [None]:
print(df.columns)

In [None]:
print(df.info())

In [None]:
df["Churn"] = df["Churn"].astype("int64")

In [None]:
df.describe()

In [None]:
df.describe(include=["object", "bool"])

In [None]:
df["Churn"].value_counts()

In [None]:
df.sort_values(by="Total day charge", ascending=False).head()

In [None]:
df.sort_values(by=["Churn", "Total day charge"], ascending=[True, False]).head()

In [None]:
df["Churn"].mean()

In [None]:
df.loc[0:5, "State":"Area code"]

In [None]:
df.iloc[0:5, 0:3]

In [None]:
# Select the last row of the DataFrame
df[-1:]

## **Applying Functions to Cells, Columns and Rows**
# To apply functions to each column, use apply():

In [None]:
df.apply(np.max)

**The apply method can also be used to apply a function to each row. To do this, specify axis=1.
Lambda functions are very convenient in such scenarios.**

In [None]:
df[df["State"].apply(lambda state: state[0] == "W")].head()

The map method can be used to replace values in a column by passing a dictionary of the form {old_value: new_value} as its argument:

In [None]:
d = {"No": False, "Yes": True}
df["International plan"] = df["International plan"].map(d)
df.head()

The same thing can be done with the replace method:

In [None]:
df = df.replace({"Voice mail plan": d})
df.head()

In [None]:
df.pivot_table(
    ["Total day calls", "Total eve calls", "Total night calls"],
    ["Area code"],
    aggfunc="mean",
)

In [None]:
total_calls = (
    df["Total day calls"]
    + df["Total eve calls"]
    + df["Total night calls"]
    + df["Total intl calls"]
)
df.insert(loc=len(df.columns), column="Total calls", value=total_calls)
# loc parameter is the number of columns after which to insert the Series object
# we set it to len(df.columns) to paste it at the very end of the dataframe
df.head()

In [None]:
df["Total charge"] = (
    df["Total day charge"]
    + df["Total eve charge"]
    + df["Total night charge"]
    + df["Total intl charge"]
)
df.head()

In [None]:
# get rid of just created columns
df.drop(["Total charge", "Total calls"], axis=1, inplace=True)
# and here’s how you can delete rows
df.drop([1, 2]).head()


In [None]:
pd.crosstab(df["Churn"], df["International plan"], margins=True)

Expected EDA Questions

**1. What is the size and structure of the dataset?**


In [None]:
print("Dataset size and structure:")
print(df.shape)
print(df.info())

# 2. Are there any missing values, and if so, how should they be handled?


In [None]:
print("\nMissing values:")
print(df.isnull().sum())  # Check for missing values

# 3. What are the data types of each variable, and do they need to be converted?


In [None]:
print("\nData types of each variable:")
print(df.dtypes)

# 4. What is the distribution of each variable?


In [None]:
print("\nDistribution of each variable:")
print(df.describe())

### **Day 3 Visualization**

In [None]:
# some imports to set up plotting
import matplotlib.pyplot as plt
# pip install seaborn
import seaborn as sns

# Graphics in retina format are more sharp and legible
%config InlineBackend.figure_format = 'retina'

In [None]:
sns.countplot(x="International plan", hue="Churn", data=df);

In [None]:
pd.crosstab(df["Churn"], df["Customer service calls"], margins=True)

In [None]:
sns.countplot(x="Customer service calls", hue="Churn", data=df);

In [None]:
df["Many_service_calls"] = (df["Customer service calls"] > 3).astype("int")

pd.crosstab(df["Many_service_calls"], df["Churn"], margins=True)

In [None]:
sns.countplot(x="Many_service_calls", hue="Churn", data=df);

In [None]:
pd.crosstab(df["Many_service_calls"] & df["International plan"], df["Churn"])

In [None]:
# Pairplot
sns.pairplot(df, hue="Churn", markers=["o", "s", "D"])

In [None]:
# Histogram
plt.figure()
sns.histplot(data=df, x='Churn', kde=True)

In [None]:
# get rid of just created columns
df.drop(["State"], axis=1, inplace=True)
# and here’s how you can delete rows
df.drop([1, 2]).head()


In [None]:
# Assuming df is your DataFrame
correlation_matrix = df.corr()


In [None]:
print(correlation_matrix)

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Select numeric columns for outlier detection
numeric_columns = ['Account length', 'Number vmail messages', 'Total day minutes', 'Total day calls',
                   'Total day charge', 'Total eve minutes', 'Total eve calls', 'Total eve charge',
                   'Total night minutes', 'Total night calls', 'Total night charge', 'Total intl minutes',
                   'Total intl calls', 'Total intl charge', 'Customer service calls']


In [None]:
# Compute Z-score for selected numeric columns
z_scores = (df[numeric_columns] - df[numeric_columns].mean()) / df[numeric_columns].std()


In [None]:
# Set threshold for outlier detection (e.g., z-score > 3 or < -3)
threshold = 3

In [None]:
# Find outliers
outliers = (z_scores > threshold) | (z_scores < -threshold)

In [None]:
# Print outliers
print("Outliers:")
print(outliers.sum())

In [None]:
# Visualize outliers
# For example, you can create box plots or scatter plots to visualize outliers
# For numerical columns, you can create box plots
df.boxplot(column=numeric_columns, figsize=(12, 8))
plt.xticks(rotation=45)
plt.title("Boxplot of Numerical Columns")
plt.show()

7. Are there any patterns or trends in the data?

 Trend analysis or seasonal decomposition can be performed on time series data if applicable

8. Are there any interesting relationships or insights that can be gained from visualizing the data?

    Visualization libraries like Matplotlib or Seaborn can be used to plot relationships between variables
9. Can you propose a machine learning model to predict customer churn based on the dataset?
10. How would you validate the performance of the churn prediction model?
11. other than churn what else can be predicted ?

## **Day 4 -- Feature Selection**

# Feature Selection with SelectK best




In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split





In [None]:
# Separate features (X) and target variable (y)
X = df.drop(columns=['Churn'])  # Features
y = df['Churn']  # Target variable

print(X)

In [None]:
print(y)

In [None]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Initialize SelectKBest with ANOVA F-test
selector = SelectKBest(score_func=f_classif, k=5)  # Select top 5 features
X_selected = selector.fit_transform(X_train, y_train)

In [None]:
# Get indices of selected features
selected_indices = selector.get_support(indices=True)

In [None]:
# Get the names of selected features
selected_features = X.columns[selected_indices]

In [None]:
# Print selected features
print("Selected Features:")
print(selected_features)

# Feature Selection with Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

In [None]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [None]:
# Initialize and train Random Forest Classifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)



In [None]:
# Extract feature importances
feature_importances = rf.feature_importances_



In [None]:
# Sort feature importances in descending order
sorted_indices = feature_importances.argsort()[::-1]



In [None]:
# Print feature importances
print("Feature Importances:")
for idx in sorted_indices:
    print(f"{X.columns[idx]}: {feature_importances[idx]}")



In [None]:
# Select features based on importance
sfm = SelectFromModel(rf, threshold='median')
X_selected = sfm.fit_transform(X_train, y_train)



In [None]:
# Print selected features
selected_features = X.columns[sfm.get_support()]
print("\nSelected Features:")
print(selected_features)

1. **What is feature selection, and why is it important in machine learning?**
   - Feature selection is the process of selecting a subset of relevant features from the original feature set. It is important in machine learning to improve model performance, reduce overfitting, and enhance interpretability.

2. **What are the main types of feature selection methods?**
   - The main types of feature selection methods are filter methods, wrapper methods, and embedded methods.

3. **Can you explain the difference between filter, wrapper, and embedded methods for feature selection?**
   - Filter methods evaluate the relevance of features based on statistical measures or scores. Wrapper methods use a specific machine learning algorithm to evaluate subsets of features. Embedded methods incorporate feature selection into the model training process.

4. **What is Recursive Feature Elimination (RFE)? How does it work, and what are its advantages and disadvantages?**
   - Recursive Feature Elimination (RFE) recursively removes features and selects the subset of features that contribute most to the model's performance. Its advantages include simplicity and effectiveness, while disadvantages include computational complexity and potential overfitting.

5. **What are some common univariate feature selection techniques? How do they work?**
   - Common univariate feature selection techniques include ANOVA F-test, chi-square test, and mutual information. They evaluate the relationship between each feature and the target variable independently.

6. **How do you handle categorical variables in feature selection?**
   - Categorical variables can be handled in feature selection by encoding them into numerical representations before applying feature selection techniques.

7. **How do you assess the effectiveness of a feature selection method?**
    - The effectiveness of a feature selection method can be assessed by evaluating its impact on model performance metrics.

In [None]:
import pandas as pd

# Create a DataFrame with a categorical variable
data = {'color': ['red', 'blue', 'green', 'red', 'green']}
df = pd.DataFrame(data)

# One-hot encode the 'color' variable
df_encoded = pd.get_dummies(df, columns=['color'])



In [None]:
# Print the encoded DataFrame
print(df_encoded)

# **Day 5 Model Building**


1. **What is the difference between supervised and unsupervised learning?**
   
   - Supervised learning involves training a model on labeled data, where each data point is associated with a target variable or outcome. The goal is to learn a mapping from input features to the target variable.
   - Unsupervised learning, on the other hand, deals with unlabeled data, where the model aims to find patterns or structures in the data without explicit supervision. It's often used for tasks such as clustering, dimensionality reduction, and anomaly detection.

2. **How do you handle missing values in a dataset?**

   - Missing values can be handled by various techniques such as:
     - Imputation: Replacing missing values with a statistical measure such as the mean, median, or mode of the feature.
     - Deleting: Removing rows or columns with missing values if they constitute a small portion of the dataset.
     - Advanced techniques: Using algorithms like k-Nearest Neighbors (k-NN) or Expectation-Maximization (EM) to estimate missing values based on other data points.

3. **What is cross-validation, and why is it useful?**

   - Cross-validation is a technique used to assess the performance and generalization ability of a machine learning model. It involves splitting the dataset into multiple subsets (folds), training the model on several combinations of these subsets, and evaluating its performance on the remaining data.
   - Cross-validation is useful because it provides a more robust estimate of the model's performance compared to a single train-test split. It helps detect overfitting and provides a more accurate representation of how the model will perform on unseen data.

4. **What are hyperparameters, and how do you tune them?**

   - Hyperparameters are parameters that are set before the learning process begins and control aspects of the learning process itself. Examples include the learning rate in gradient descent, the number of hidden layers in a neural network, and the regularization parameter in regression models.
   - Hyperparameter tuning involves selecting the optimal values for these parameters to improve the performance of the model. Techniques for hyperparameter tuning include grid search, random search, and Bayesian optimization.

5. **How do you interpret the confusion matrix in a classification problem?**

    - A confusion matrix is a table that summarizes the performance of a classification model by comparing predicted class labels with actual class labels. It consists of four quadrants: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
   - From the confusion matrix, various performance metrics can be derived, such as accuracy, precision, recall (sensitivity), specificity, and F1-score. These metrics provide insights into the model's ability to correctly classify instances of each class and its overall performance.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVC
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error,mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Read the CSV file
df = pd.read_csv('/content/drive/MyDrive/telecom_churn.csv')
df

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.70,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.70,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.30,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.90,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,AZ,192,415,No,Yes,36,156.2,77,26.55,215.5,126,18.32,279.1,83,12.56,9.9,6,2.67,2,False
3329,WV,68,415,No,No,0,231.1,57,39.29,153.4,55,13.04,191.3,123,8.61,9.6,4,2.59,3,False
3330,RI,28,510,No,No,0,180.8,109,30.74,288.8,58,24.55,191.9,91,8.64,14.1,6,3.81,2,False
3331,CT,184,510,Yes,No,0,213.8,105,36.35,159.6,84,13.57,139.2,137,6.26,5.0,10,1.35,2,False


In [None]:
df.columns

Index(['State', 'Account length', 'Area code', 'International plan',
       'Voice mail plan', 'Number vmail messages', 'Total day minutes',
       'Total day calls', 'Total day charge', 'Total eve minutes',
       'Total eve calls', 'Total eve charge', 'Total night minutes',
       'Total night calls', 'Total night charge', 'Total intl minutes',
       'Total intl calls', 'Total intl charge', 'Customer service calls',
       'Churn'],
      dtype='object')

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Assuming df is your DataFrame
# Drop 'phone number' column as it's unlikely to provide any meaningful information
#df.drop(columns=['phone number'], inplace=True)

# 1. Handling missing values (if any)
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values:\n", missing_values)

# 2. Encoding categorical variables
# Convert 'international plan' and 'voice mail plan' to binary variables (0 or 1)
binary_cols = ['International plan', 'Voice mail plan']
for col in binary_cols:
    df[col] = df[col].map({'No': 0, 'Yes': 1})

# Encode 'state' column using LabelEncoder
label_encoder = LabelEncoder()
df['State'] = label_encoder.fit_transform(df['State'])

# 3. Handling outliers
# Define a function to remove outliers using IQR method
def remove_outliers(df, cols):
    for col in cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return df

# Apply remove_outliers function to numerical columns
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
df = remove_outliers(df, numerical_cols)

# 4. Handling duplicates
# Drop duplicate rows
df.drop_duplicates(inplace=True)

Missing values:
 State                     0
Account length            0
Area code                 0
International plan        0
Voice mail plan           0
Number vmail messages     0
Total day minutes         0
Total day calls           0
Total day charge          0
Total eve minutes         0
Total eve calls           0
Total eve charge          0
Total night minutes       0
Total night calls         0
Total night charge        0
Total intl minutes        0
Total intl calls          0
Total intl charge         0
Customer service calls    0
Churn                     0
dtype: int64


In [None]:
# Split features and target variable
X = df.drop(columns=['Churn'])
y = df['Churn']

In [None]:
print(X)

      State  Account length  Area code  International plan  Voice mail plan  \
0        16             128        415                   0                1   
1        35             107        415                   0                1   
2        31             137        415                   0                0   
11       39              74        415                   0                0   
12       12             168        408                   0                0   
...     ...             ...        ...                 ...              ...   
3327     40              79        415                   0                0   
3328      3             192        415                   0                1   
3329     49              68        415                   0                0   
3330     39              28        510                   0                0   
3332     42              74        415                   0                1   

      Number vmail messages  Total day minutes  Tot

In [None]:
print(y)

0       False
1       False
2       False
11      False
12      False
        ...  
3327    False
3328    False
3329    False
3330    False
3332    False
Name: Churn, Length: 2522, dtype: bool


In [None]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Scale numerical features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix

# Initialize SVC model with linear kernel for classification
svc_model = SVC(kernel='linear')

# Fit the SVC model to the training data
svc_model.fit(X_train_scaled, y_train)

# Make predictions on the test data
y_pred = svc_model.predict(X_test_scaled)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)


Accuracy: 0.9326732673267327
Confusion Matrix:
 [[463   0]
 [ 34   8]]


# or

In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

# Define models
models = {
    'Support Vector Machine': SVC(),
    'Multi-layer Perceptron (Neural Network)': MLPClassifier()
}

# Initialize lists to store model performance
performance_data = []
selected_features_data = []

# Train and evaluate models for churn prediction
for name, model in models.items():
    # Feature selection
    feature_selector = SelectKBest(score_func=mutual_info_classif)
    X_train_selected = feature_selector.fit_transform(X_train, y_train)
    X_test_selected = feature_selector.transform(X_test)
     # Boolean mask of selected features



    # Train and evaluate
    model.fit(X_train_selected, y_train)
    y_train_pred = model.predict(X_train_selected)
    train_accuracy = accuracy_score(y_train, y_train_pred)
    train_confusion_matrix = confusion_matrix(y_train, y_train_pred)
    y_test_pred = model.predict(X_test_selected)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_confusion_matrix = confusion_matrix(y_test, y_test_pred)

    # Store performance metrics
    performance_data.append({
        'Model': name,
        'Train Accuracy': train_accuracy,
        'Test Accuracy': test_accuracy,
        'Train Confusion Matrix': train_confusion_matrix,
        'Test Confusion Matrix': test_confusion_matrix
    })

performance_df = pd.DataFrame(performance_data)

performance_df

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train Confusion Matrix,Test Confusion Matrix
0,Support Vector Machine,0.924641,0.916832,"[[1865, 0], [152, 0]]","[[463, 0], [42, 0]]"
1,Multi-layer Perceptron (Neural Network),0.936539,0.934653,"[[1809, 56], [72, 80]]","[[446, 17], [16, 26]]"


1. What is the opposite of supervised learning?


   - Unsupervised.

2. What does SVM stand for?


   - Support Vector Machine.

3. What does MSE stand for?


   - Mean Squared Error.

4. What is another name for the target variable?



   - Dependent.

5. What is a measure of the model's flexibility?



   - Complexity.

6. What type of learning uses labeled data?



   - Supervised.

7. What do you call a model that has learned patterns without explicit supervision?




   - Unsupervised.

8. What is the goal of feature engineering?



   - Improve model performance.

9. What type of learning involves predicting continuous values?





   - Regression.

10. What measures the difference between predicted and actual values?






    - Residuals.