[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alihaider-debug/Cricketdataanalysis/blob/main/Task8,9,10.ipynb)

**🧪 Step 1: Load Dataset and Preprocess**

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold, LeaveOneOut
from sklearn.metrics import accuracy_score, mean_squared_error, f1_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier

github_url = "https://github.com/alihaider-debug/Cricketdataanalysis/raw/main/ODI_Match_Data.csv"
df = pd.read_csv(github_url)

# Handle categorical attributes (Label Encoding)
label_encoders = {}
for col in df.select_dtypes(include='object').columns:
    # Check if the column contains mixed types
    if df[col].apply(type).nunique() > 1:
        # Convert all values to strings if mixed types are present
        df[col] = df[col].astype(str)
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

# Handle missing values (fill with median)
df.fillna(df.median(numeric_only=True), inplace=True)

# Separate features and target
X = df.drop(columns=['runs_off_bat'])
y = df['runs_off_bat']

# Scale numerical features
scaler = StandardScaler()
X = scaler.fit_transform(X)

  df = pd.read_csv(github_url)
  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


**📊 Step 2: Holdout Validation (80-20 Split)**

In [3]:
# Holdout Validation (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
f1 = f1_score(y_test, y_pred, average='weighted')

print("🎯 Holdout Validation Results:")
print(f"Accuracy: {accuracy:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"F1-score: {f1:.4f}")


🎯 Holdout Validation Results:
Accuracy: 0.4821
RMSE: 1.7176
F1-score: 0.4739


**📈 Step 3: K-Fold Cross-Validation (k=5)**

In [None]:
# K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_accuracy = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
cv_rmse = np.sqrt(-cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error'))
cv_f1 = cross_val_score(model, X, y, cv=kf, scoring='f1_weighted')

print("\n📊 K-Fold Cross-Validation Results (5 Folds):")
print(f"Accuracy: {cv_accuracy.mean():.4f}")
print(f"RMSE: {cv_rmse.mean():.4f}")
print(f"F1-score: {cv_f1.mean():.4f}")


**Step 4: Leave-One-Out Cross-Validation (LOOCV)**

In [None]:
# Leave-One-Out Cross-Validation (LOOCV)
loo = LeaveOneOut()
loo_accuracy = cross_val_score(model, X, y, cv=loo, scoring='accuracy')
loo_rmse = np.sqrt(-cross_val_score(model, X, y, cv=loo, scoring='neg_mean_squared_error'))
loo_f1 = cross_val_score(model, X, y, cv=loo, scoring='f1_weighted')

print("\n🧪 Leave-One-Out Cross-Validation (LOOCV) Results:")
print(f"Accuracy: {loo_accuracy.mean():.4f}")
print(f"RMSE: {loo_rmse.mean():.4f}")
print(f"F1-score: {loo_f1.mean():.4f}")



**📝 Bias-Variance Tradeoff:**

Holdout: Simple and fast but higher variance due to limited training data.

K-Fold: Balanced bias-variance tradeoff. Preferred for most projects.

LOOCV: Low bias but very high variance and computationally expensive.


**✅ Recommendation:** Use K-Fold Cross-Validation (k=5) for the best balance of performance and efficiency.

**----------------------------TASK 9------------------------------**

**Step 1: Import Libraries and Load Data**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

github_url = "https://github.com/alihaider-debug/Cricketdataanalysis/raw/main/ODI_Match_Data.csv"
df = pd.read_csv(github_url)

# Display basic info
print(df.info())
print(df['target'].value_counts())


**📊 Step 2: Perform Stratified Sampling**

In [None]:
# Perform stratified split (80% train, 20% test) based on target class
train_set, test_set = train_test_split(df, test_size=0.2, stratify=df['target'], random_state=42)

# Verify distribution in train and test sets
print("Train set distribution:\n", train_set['target'].value_counts(normalize=True))
print("\nTest set distribution:\n", test_set['target'].value_counts(normalize=True))


**📈 Step 3: Marginal Probability Analysis**

In [None]:
# Calculate marginal probabilities in the original, train, and test sets
original_probs = df['target'].value_counts(normalize=True)
train_probs = train_set['target'].value_counts(normalize=True)
test_probs = test_set['target'].value_counts(normalize=True)

# Display results
print("\nMarginal Probabilities Comparison:")
print("Original Distribution:\n", original_probs)
print("Train Distribution:\n", train_probs)
print("Test Distribution:\n", test_probs)


**📝 Step 4: Conclusion**

✅ Stratified sampling ensures that both train and test sets have the same class distribution as the original dataset.

✅ Marginal probabilities are consistent across sets, making the model evaluation more reliable.

**-----------------------TASK 10-------------------------**

** Step 1: Import Libraries and Load Data**

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

# Load dataset (replace 'your_dataset.csv' with your file)
df = pd.read_csv('your_dataset.csv')

# View first rows
df.head()


 Part 1: Handling Categorical Attributes

 1. Check for Missing Values and Fill Them

In [None]:
# Check for missing values in categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns
print("Missing values in categorical columns:\n", df[categorical_cols].isnull().sum())

# Fill missing values with mode
for col in categorical_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)


**2. Encode Categorical Features:**

Label Encoding for ordinal features.

One-Hot Encoding for nominal features

In [None]:
# Label Encoding for ordinal features (e.g., Education Level: High School < Bachelor < Master)
label_encoder = LabelEncoder()
df['education_level'] = label_encoder.fit_transform(df['education_level'])

# One-Hot Encoding for nominal features (e.g., Gender, Region)
df = pd.get_dummies(df, columns=['gender', 'region'], drop_first=True)


**Part 2: Handling Text Attributes**

**1. Text Preprocessing (Cleaning)**

In [None]:
import re

# Function to clean text
def clean_text(text):
    text = re.sub(r'\W', ' ', text)  # Remove special characters
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = text.lower().strip()       # Convert to lowercase
    return text

# Clean 'review' text column
df['review'] = df['review'].apply(clean_text)


**2. Convert Text to Numerical Format (TF-IDF Vectorization)**

In [None]:
# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['review'])

# Convert TF-IDF matrix to DataFrame and concatenate
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())
df = pd.concat([df.reset_index(drop=True), tfidf_df.reset_index(drop=True)], axis=1)

# Drop the original text column
df.drop(columns=['review'], inplace=True)


**Part 3: Scaling Numerical Features**

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Select numerical columns
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns

# Apply Standardization
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

# Optional: Apply MinMax Scaling
# minmax_scaler = MinMaxScaler()
# df[numerical_cols] = minmax_scaler.fit_transform(df[numerical_cols])


**Part 4: Create New Features (If Applicable)**

In [None]:
# Example: Create a new feature combining age and income
df['income_per_age'] = df['income'] / (df['age'] + 1)  # Avoid division by zero
