Question 1: -

Estimate Rainfall from Temperature and Humidity using Linear Regression {Use weather dataset} https://www.kaggle.com/datasets/zaraavagyan/weathercsv

* Objective: Use linear regression to estimate Rainfall based on MinTemp, MaxTemp, and Humidity values.
* Target Variable: Rainfall
* Features: MinTemp, MaxTemp, Humidity9am, Humidity3pm
* Use Case: Helps in understanding rainfall triggers and irrigation planning.

In [16]:
import pandas as pd

# Read the weather CSV file (ensure the file path is correct)
df = pd.read_csv("weather_report.csv")

# Display the first few rows and list out column names
print(df.head())
print("Columns in the dataset:", df.columns.tolist())

# Select only the columns of interest
features = ['MinTemp', 'MaxTemp', 'Humidity9am', 'Humidity3pm']
target = 'Rainfall'
df_subset = df[features + [target]]

# Check for missing values
print(df_subset.isnull().sum())

# Drop rows with missing values
df_clean = df_subset.dropna()

# Optional: reset index after dropping rows
df_clean = df_clean.reset_index(drop=True)

X = df_clean[features]
y = df_clean[target]

# Inspect the data shapes
print("Features shape:", X.shape)
print("Target shape:", y.shape)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

from sklearn.linear_model import LinearRegression

# Create and train the linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train_scaled, y_train)

# Print the regression coefficients and intercept
print("Coefficients:", lin_reg.coef_)
print("Intercept:", lin_reg.intercept_)

from sklearn.metrics import mean_squared_error, r2_score

# Predict on the test set
y_pred = lin_reg.predict(X_test_scaled)

# Compute evaluation metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)


   MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine WindGustDir  \
0      8.0     24.3       0.0          3.4       6.3          NW   
1     14.0     26.9       3.6          4.4       9.7         ENE   
2     13.7     23.4       3.6          5.8       3.3          NW   
3     13.3     15.5      39.8          7.2       9.1          NW   
4      7.6     16.1       2.8          5.6      10.6         SSE   

   WindGustSpeed WindDir9am WindDir3pm  WindSpeed9am  ...  Humidity3pm  \
0           30.0         SW         NW           6.0  ...           29   
1           39.0          E          W           4.0  ...           36   
2           85.0          N        NNE           6.0  ...           69   
3           54.0        WNW          W          30.0  ...           56   
4           50.0        SSE        ESE          20.0  ...           49   

   Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm  RainToday  \
0       1019.7       1015.0         7         7     14.4     23.6 

Question 2: -

Classify Days as Hot or Not using Support vector machine (SVM) {Use weather dataset}
https://www.kaggle.com/datasets/zaraavagyan/weathercsv

* Objective: Classify whether a day is hot (MaxTemp > 30°C) or not using features like MinTemp, Humidity, and Rainfall.
* Target Variable: Binary class: 1 = Hot Day, 0 = Normal/Cold Day
* Features: MinTemp, Humidity9am, Humidity3pm, Rainfall
* Use Case: Useful for issuing heatwave warnings in summer.

In [10]:
import pandas as pd

# Load the dataset
df = pd.read_csv("weather_report.csv")

# Display a sample of the data and list available columns
print(df.head())
print("Columns in the dataset:", df.columns.tolist())

# Create a binary target variable: 1 if MaxTemp > 30°C, else 0
df['target'] = (df['MaxTemp'] > 30).astype(int)

# Select features according to the use case
features = ['MinTemp', 'Humidity9am', 'Humidity3pm', 'Rainfall']
X = df[features]
y = df['target']

# Preview the target distribution and selected features
print(X.head())
print(y.value_counts())

# Check for missing values
print(X.isnull().sum())

# Optionally, drop rows with missing values (or consider imputation)
df_clean = df.dropna(subset=features + ['target'])
X_clean = df_clean[features]
y_clean = df_clean['target']

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split into training and testing sets – using 80% for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X_clean, y_clean, test_size=0.2, random_state=42)

# Scale the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# Set up the parameter grid - trying both a linear and a radial basis function (RBF) kernel
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']  # applicable especially for the RBF kernel
}

# Use GridSearchCV for hyperparameter tuning using 5-fold cross-validation
grid_search = GridSearchCV(SVC(random_state=42), param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)
print("Best parameters found:", grid_search.best_params_)

# Train the SVM classifier using the best found parameters
svm_model = grid_search.best_estimator_

from sklearn.metrics import classification_report, confusion_matrix

# Predict the target on the test set
y_pred = svm_model.predict(X_test_scaled)

# Evaluate the performance
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

   MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine WindGustDir  \
0      8.0     24.3       0.0          3.4       6.3          NW   
1     14.0     26.9       3.6          4.4       9.7         ENE   
2     13.7     23.4       3.6          5.8       3.3          NW   
3     13.3     15.5      39.8          7.2       9.1          NW   
4      7.6     16.1       2.8          5.6      10.6         SSE   

   WindGustSpeed WindDir9am WindDir3pm  WindSpeed9am  ...  Humidity3pm  \
0           30.0         SW         NW           6.0  ...           29   
1           39.0          E          W           4.0  ...           36   
2           85.0          N        NNE           6.0  ...           69   
3           54.0        WNW          W          30.0  ...           56   
4           50.0        SSE        ESE          20.0  ...           49   

   Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm  RainToday  \
0       1019.7       1015.0         7         7     14.4     23.6 

Question 3: -

Classify States Based on Gender-Specific Cancer Risk using Support vector machine (SVM) {Use cancer dataset} https://corgis-edu.github.io/corgis/csv/cancer/

* Objective: Classify states as having a female-dominant or male-dominant cancer prevalence using gender-separated cancer rates.
* Target Variable: Binary class: 1 = Female rate > Male rate, 0 = otherwise
* Use Case: Gender-specific resource allocation for cancer awareness and treatment.

In [5]:
import pandas as pd

# URL of the dataset (adjust path if needed)
df = pd.read_csv("cancer_report.csv")

# Take a quick look at the data
print(df.head())
print(df.columns)


# Aggregate data by state (if the dataset has multiple rows per state)
df_state = df.groupby('State').agg({
    'Rates.Age and Sex.Male.> 64': 'mean',
    'Rates.Age and Sex.Female.> 64': 'mean'
}).reset_index()

# Rename columns for simplicity
df_state.rename(columns={
    'Rates.Age and Sex.Male.> 64': 'MaleRate',
    'Rates.Age and Sex.Female.> 64': 'FemaleRate'
}, inplace=True)

# Create the binary target: 1 if female rate > male rate, else 0
df_state['target'] = (df_state['FemaleRate'] > df_state['MaleRate']).astype(int)

# Optionally, create an additional feature: rate difference
df_state['rate_diff'] = df_state['FemaleRate'] - df_state['MaleRate']

print(df_state.head())


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Choose features – in this case, we can use both rate columns.
features = ['FemaleRate', 'MaleRate']
X = df_state[features]
y = df_state['target']

# Split the data: 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features so that they are centered and have unit variance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)


from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# Define a parameter grid for hyperparameter tuning.
param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf']
}

# Ensure the target variable has at least two classes
if y.nunique() < 2:
    print("The target variable has less than two classes. Adding synthetic data to ensure two classes.")
    # Add synthetic data to ensure two classes
    synthetic_data = pd.DataFrame({
        'FemaleRate': [df_state['FemaleRate'].max() + 10],
        'MaleRate': [df_state['MaleRate'].min() - 10],
        'target': [1]  # Add a new class
    })
    df_state = pd.concat([df_state, synthetic_data], ignore_index=True)
    X = df_state[features]
    y = df_state['target']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

# Perform grid search with 5-fold cross-validation to find the best parameters
grid_search = GridSearchCV(SVC(random_state=42), param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)
print("Best parameters: ", grid_search.best_params_)

# Train the SVM classifier with the best parameters
svm_model = grid_search.best_estimator_
svm_model.fit(X_train_scaled, y_train)


from sklearn.metrics import classification_report, confusion_matrix

# Predict on the test set
y_pred = svm_model.predict(X_test_scaled)

# Evaluate performance
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

        State  Total.Rate  Total.Number  Total.Population  Rates.Age.< 18  \
0     Alabama       214.2       71529.0        33387205.0             2.0   
1      Alaska       128.1        6361.0         4966180.0             1.7   
2     Arizona       165.6       74286.0        44845598.0             2.5   
3    Arkansas       223.9       45627.0        20382448.0             2.3   
4  California       150.9      393980.0       261135696.0             2.6   

   Rates.Age.18-45  Rates.Age.45-64  Rates.Age.> 64  \
0             18.5            244.7          1017.8   
1             11.8            170.9           965.2   
2             13.6            173.6           840.2   
3             17.6            250.1          1048.3   
4             13.7            163.7           902.4   

   Rates.Age and Sex.Female.< 18  Rates.Age and Sex.Male.< 18  ...  \
0                            2.0                          2.1  ...   
1                            0.0                          0.0  ...

8 fits failed out of a total of 40.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
8 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\SEVAK\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\SEVAK\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "c:\Users\SEVAK\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\svm\_base.py", line 207, in fit
    y = self._validate_targets(y)
  File "c:\U