# __Logistic Regression__

### __Theoretical__

### 1.What is Logistic Regression, and how does it differ from Linear Regression?

Logistic Regression is a statistical method used for binary classification tasks. Unlike Linear Regression, which predicts continuous outcomes, Logistic Regression predicts probabilities and classifies outcomes into categories using a sigmoid function.

### 2. What is the mathematical equation of Logistic Regression?

The equation is:
P(y=1|x)=1/1+e^-(β0+β1x1+β2x2+...+βnxn)

### 3. Why do we use the Sigmoid function in Logistic Regression?


The Sigmoid function maps any real value into a probability between 0 and 1, making it ideal for binary classification.

### 4. What is the cost function of Logistic Regression?


The cost function for logistic regression is defined as follows:

\[ J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right] \]


### 5. What is Regularization in Logistic Regression? Why is it needed?

Regularization prevents overfitting by penalizing large coefficients. It helps improve the generalization of the model.

### 6. Explain the difference between Lasso, Ridge, and Elastic Net regression.

- Lasso (L1 Regularization): Shrinks some coefficients to zero for feature selection.

- Ridge (L2 Regularization): Shrinks coefficients but doesn’t reduce them to zero.

- Elastic Net: A combination of L1 and L2 regularization.

### 7. When should we use Elastic Net instead of Lasso or Ridge?

Use Elastic Net when there are multiple correlated features, combining Lasso's feature selection and Ridge's shrinkage.

### 8. What is the impact of the regularization parameter (λ)) in Logistic Regression?

A larger λ increases regularization strength, penalizing larger coefficients and reducing overfitting.

### 9. What are the key assumptions of Logistic Regression?

- Binary outcome variable.

- No multicollinearity among features.

- Large sample size.

- Linearity between independent variables and log odds.

### 10. What are some alternatives to Logistic Regression for classification tasks?

- Decision Trees

- Random Forests

- Support Vector Machines (SVM)

- K-Nearest Neighbors (KNN)

- Neural Networks

### 11. What are Classification Evaluation Metrics?

- Accuracy

- Precision

- Recall

- F1-Score

- ROC-AUC Score

### 12. How does class imbalance affect Logistic Regression?

It can lead to biased predictions towards the majority class. Using class weights or resampling techniques helps mitigate this.

### 13. What is Hyperparameter Tuning in Logistic Regression?


### 14. What are different solvers in Logistic Regression? Which one should be used?

- liblinear: Good for small datasets.

- saga: Supports Elastic Net regularization.

- lbfgs: Good for large datasets.

- newton-cg: Suitable for multinomial loss.

### 15. How is Logistic Regression extended for multiclass classification?

Using methods like One-vs-Rest (OvR) or Softmax Regression for handling multiple classes.

### 16. What are the advantages and disadvantages of Logistic Regression?

- Advantages: Simple, interpretable, fast.

- Disadvantages: Limited to linear decision boundaries, sensitive to outliers.

### 17. What are some use cases of Logistic Regression?

- Spam detection

- Fraud detection

- Customer churn prediction

- Medical diagnosis

### 18. What is the difference between Softmax Regression and Logistic Regression?


Softmax handles multiclass classification, whereas Logistic Regression is typically used for binary classification.

### 19. How do we choose between One-vs-Rest (OvR) and Softmax for multiclass classification?

- OvR: When classes are imbalanced.

- Softmax: When classes are mutually exclusive.

### 20. How do we interpret coefficients in Logistic Regression?

Each coefficient represents the change in log odds of the outcome for a one-unit change in the predictor variable.

### __Practical__

### 1.Write a Python program that loads a dataset, splits it into training and testing sets, applies Logistic Regression, and prints the model accuracy.

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load dataset
df = pd.read_csv('amazon.csv')  # Replace with your dataset file name

# Identify categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns

# Convert categorical features to numerical using Label Encoding
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('discounted_price', axis=1)  # Replace 'target' with your actual target column
y = df['discounted_price']

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)


Model Accuracy: 0.08532423208191127


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### 2.Write a Python program to apply L1 regularization (Lasso) on a dataset using LogisticRegression(penalty='l1') and print the model accuracy

In [2]:
# Load dataset
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('discounted_price', axis=1)  # Replace 'target' with your actual target column name
y = df['actual_price']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply Logistic Regression with L1 Regularization (Lasso)
lasso_model = LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000)
lasso_model.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = lasso_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("L1 Regularization (Lasso) Model Accuracy:", accuracy)

L1 Regularization (Lasso) Model Accuracy: 0.17747440273037543


### 3.Write a Python program to train Logistic Regression with L2 regularization (Ridge) using LogisticRegression(penalty='l2'). Print model accuracy and coefficients

In [3]:
# Load dataset
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model with L2 Regularization (Ridge)
ridge_model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)
ridge_model.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = ridge_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("L2 Regularization (Ridge) Model Accuracy:", accuracy)

# Print model coefficients
coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': ridge_model.coef_[0]
})
print("\nModel Coefficients:\n", coefficients)

L2 Regularization (Ridge) Model Accuracy: 0.20136518771331058

Model Coefficients:
                 Feature  Coefficient
0            product_id     0.002690
1          product_name    -0.016113
2              category     0.001144
3      discounted_price    -0.000947
4          actual_price     0.005354
5   discount_percentage     0.000738
6                rating     0.000010
7          rating_count     0.006789
8         about_product     0.014775
9               user_id    -0.007825
10            user_name     0.015989
11            review_id    -0.003824
12         review_title    -0.016667
13             img_link     0.023446
14         product_link    -0.018992


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### 4. Write a Python program to train Logistic Regression with Elastic Net Regularization (penalty='elasticnet')

In [4]:
# Load dataset
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model with Elastic Net Regularization
elastic_net_model = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5, max_iter=1000)
elastic_net_model.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = elastic_net_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Elastic Net Regularization Model Accuracy:", accuracy)

Elastic Net Regularization Model Accuracy: 0.1945392491467577




### 5. Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr'

In [5]:
# Load dataset
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model for multiclass classification using One-vs-Rest (OvR)
multiclass_model = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=1000)
multiclass_model.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = multiclass_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Multiclass Classification (OvR) Model Accuracy:", accuracy)

Multiclass Classification (OvR) Model Accuracy: 0.16382252559726962


### 6. Write a Python program to apply GridSearchCV to tune the hyperparameters (C and penalty) of Logistic Regression. Print the best parameters and accuracy

In [6]:
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'penalty': ['l1', 'l2'],       # Regularization type
    'solver': ['liblinear']        # Solver that supports both L1 and L2
}

# Initialize and apply GridSearchCV
grid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print best parameters and corresponding accuracy
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)

# Evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test Set Accuracy with Best Parameters:", accuracy)

NameError: name 'GridSearchCV' is not defined

### 7. Write a Python program to evaluate Logistic Regression using Stratified K-Fold Cross-Validation. Print the average accuracy

In [None]:
# Load dataset
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Initialize Logistic Regression model
model = LogisticRegression(max_iter=1000)

# Define Stratified K-Fold Cross-Validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation and calculate accuracy scores
scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')

# Print each fold's accuracy and the average accuracy
print("Accuracy for each fold:", scores)
print("Average Accuracy across folds:", scores.mean())

### 8.Write a Python program to load a dataset from a CSV file, apply Logistic Regression, and evaluate its accuracy.

In [None]:
# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Logistic Regression Model Accuracy:", accuracy)

### 9.Write a Python program to apply RandomizedSearchCV for tuning hyperparameters (C, penalty, solver) in Logistic Regression. Print the best parameters and accuracy

In [10]:
import numpy as np
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter distribution for RandomizedSearchCV
param_distributions = {
    'C': np.logspace(-3, 2, 10),  # Regularization strength from 0.001 to 100
    'penalty': ['l1', 'l2', 'elasticnet', 'none'],  # Different regularization types
    'solver': ['liblinear', 'saga', 'lbfgs', 'newton-cg', 'sag'],  # Solvers supported by Logistic Regression
}

# Initialize Logistic Regression model
model = LogisticRegression(max_iter=1000)

# Apply RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_distributions,
    n_iter=20,  # Number of parameter settings sampled
    cv=5,       # 5-fold cross-validation
    scoring='accuracy',
    random_state=42
)

# Fit the model using random search
random_search.fit(X_train, y_train)

# Print best parameters and best cross-validation accuracy
print("Best Hyperparameters:", random_search.best_params_)
print("Best Cross-Validation Accuracy:", random_search.best_score_)

# Evaluate the best model on the test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test Set Accuracy with Best Parameters:", accuracy)

NameError: name 'RandomizedSearchCV' is not defined

### 10.Write a Python program to implement One-vs-One (OvO) Multiclass Logistic Regression and print accuracy

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsOneClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model using One-vs-One strategy
ovo_model = OneVsOneClassifier(LogisticRegression(max_iter=1000))
ovo_model.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = ovo_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("One-vs-One (OvO) Multiclass Logistic Regression Accuracy:", accuracy)


### 11.Write a Python program to train a Logistic Regression model and visualize the confusion matrix for binary classification

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Logistic Regression Model Accuracy:", accuracy)

# Generate and visualize confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Visualizing the confusion matrix with a heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix for Binary Classification')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### 12.Write a Python program to train a Logistic Regression model and evaluate its performance using Precision, Recall, and F1-Score

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
from sklearn.preprocessing import LabelEncoder

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict on the test data
y_pred = model.predict(X_test)

# Calculate and print Precision, Recall, and F1-Score
precision = precision_score(y_test, y_pred, average='binary')  # Use 'macro' or 'weighted' for multiclass
recall = recall_score(y_test, y_pred, average='binary')
f1 = f1_score(y_test, y_pred, average='binary')

print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)

# Print the full classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))

### 13.Write a Python program to train a Logistic Regression model on imbalanced data and apply class weights to improve model performance

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Check class distribution
print("Class distribution:\n", y.value_counts())

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model with balanced class weights
model = LogisticRegression(class_weight='balanced', max_iter=1000)
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Evaluate model performance
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix Visualization
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix with Class Weights')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### 14.Write a Python program to train Logistic Regression on the Titanic dataset, handle missing values, andevaluate performance

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt

# Load Titanic dataset
df = pd.read_csv('titanic.csv')  # Replace with your actual dataset filename

# Display initial missing values count
print("Missing values before handling:\n", df.isnull().sum())

# Handle missing values
# Fill missing Age values with the median age
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing Embarked values with the mode (most frequent value)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Drop any remaining missing values (if any)
df.dropna(inplace=True)

# Encode categorical variables using Label Encoding
label_encoder = LabelEncoder()
df['Sex'] = label_encoder.fit_transform(df['Sex'])  # Male = 1, Female = 0
df['Embarked'] = label_encoder.fit_transform(df['Embarked'])  # Encode Embarked

# Prepare features (X) and target (y)
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = df[features]
y = df['Survived']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy on Titanic Dataset:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix Visualization
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix for Titanic Dataset')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### 15.Write a Python program to apply feature scaling (Standardization) before training a Logistic Regression model. Evaluate its accuracy and compare results with and without scaling

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 1️⃣ Logistic Regression without Feature Scaling
model_without_scaling = LogisticRegression(max_iter=1000)
model_without_scaling.fit(X_train, y_train)
y_pred_without_scaling = model_without_scaling.predict(X_test)
accuracy_without_scaling = accuracy_score(y_test, y_pred_without_scaling)
print("Accuracy without Feature Scaling:", accuracy_without_scaling)

# 2️⃣ Apply Feature Scaling (Standardization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression with Feature Scaling
model_with_scaling = LogisticRegression(max_iter=1000)
model_with_scaling.fit(X_train_scaled, y_train)
y_pred_with_scaling = model_with_scaling.predict(X_test_scaled)
accuracy_with_scaling = accuracy_score(y_test, y_pred_with_scaling)
print("Accuracy with Feature Scaling:", accuracy_with_scaling)

# 3️⃣ Compare Results
if accuracy_with_scaling > accuracy_without_scaling:
    print("Feature scaling improved the model's accuracy.")
elif accuracy_with_scaling == accuracy_without_scaling:
    print("Feature scaling had no impact on the model's accuracy.")
else:
    print("Feature scaling decreased the model's accuracy.")

### 16.Write a Python program to train Logistic Regression and evaluate its performance using ROC-AUC score.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict probabilities for ROC AUC calculation
y_prob = model.predict_proba(X_test)[:, 1]  # Probability estimates for the positive class

# Calculate ROC-AUC score
roc_auc = roc_auc_score(y_test, y_prob)
print("ROC-AUC Score:", roc_auc)

# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'Logistic Regression (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

### 17.Write a Python program to train Logistic Regression using a custom learning rate (C=0.5) and evaluate accuracy

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model with custom regularization strength (C=0.5)
model = LogisticRegression(C=0.5, max_iter=1000)
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Logistic Regression Model Accuracy with C=0.5:", accuracy)

### 18.Write a Python program to train Logistic Regression and identify important features based on model coefficients

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Extract feature importance from model coefficients
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
}).sort_values(by='Coefficient', key=abs, ascending=False)

# Print the most important features
print("Feature Importance Based on Coefficients:\n", feature_importance)

# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'], color='skyblue')
plt.title('Feature Importance Based on Logistic Regression Coefficients')
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.gca().invert_yaxis()  # Invert y-axis to show the most important feature on top
plt.grid(True)
plt.show()

### 19.Write a Python program to train Logistic Regression and evaluate its performance using Cohen’s Kappa Score

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import cohen_kappa_score, accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1) 
y = df['reveiw_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Evaluate performance using Cohen's Kappa Score
kappa_score = cohen_kappa_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

# Print performance metrics
print("Model Accuracy:", accuracy)
print("Cohen's Kappa Score:", kappa_score)

### 20.Write a Python program to train Logistic Regression and visualize the Precision-Recall Curve for binary classification

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict probabilities for the positive class
y_scores = model.predict_proba(X_test)[:, 1]

# Calculate precision-recall values
precision, recall, thresholds = precision_recall_curve(y_test, y_scores)
average_precision = average_precision_score(y_test, y_scores)

# Plot the Precision-Recall Curve
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, marker='.', label=f'AP = {average_precision:.2f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve for Logistic Regression')
plt.legend(loc='lower left')
plt.grid(True)
plt.show()

# Print average precision score
print("Average Precision Score:", average_precision)


### 21.Write a Python program to train Logistic Regression with different solvers (liblinear, saga, lbfgs) and compare their accuracy

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['reveiw_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# List of solvers to test
solvers = ['liblinear', 'saga', 'lbfgs']
accuracy_results = {}

# Train and evaluate model using different solvers
for solver in solvers:
    model = LogisticRegression(solver=solver, max_iter=1000)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_results[solver] = accuracy
    print(f"Accuracy with solver '{solver}': {accuracy:.4f}")

# Compa


### 22.Write a Python program to train Logistic Regression and evaluate its performance using Matthews Correlation Coefficient (MCC).

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import matthews_corrcoef, accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Evaluate performance using MCC and Accuracy
mcc_score = matthews_corrcoef(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

# Print performance metrics
print("Model Accuracy:", accuracy)
print("Matthews Correlation Coefficient (MCC):", mcc_score)

### 23.Write a Python program to train Logistic Regression on both raw and standardized data. Compare their accuracy to see the impact of feature scaling.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 1️⃣ Train Logistic Regression on Raw Data
model_raw = LogisticRegression(max_iter=1000)
model_raw.fit(X_train, y_train)
y_pred_raw = model_raw.predict(X_test)
accuracy_raw = accuracy_score(y_test, y_pred_raw)
print("Accuracy on Raw Data:", accuracy_raw)

# 2️⃣ Apply Standardization (Feature Scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression on Standardized Data
model_scaled = LogisticRegression(max_iter=1000)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print("Accuracy on Standardized Data:", accuracy_scaled)

# 3️⃣ Compare Results
if accuracy_scaled > accuracy_raw:
    print("Feature scaling improved the model's accuracy.")
elif accuracy_scaled == accuracy_raw:
    print("Feature scaling had no impact on the model's accuracy.")
else:
    print("Feature scaling decreased the model's accuracy.")


### 24.Write a Python program to train Logistic Regression and find the optimal C (regularization strength) using cross-validation.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Apply Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Define parameter grid for 'C' values
param_grid = {'C': [0.001, 0.01, 0.1, 0.5, 1, 5, 10, 50, 100]}

# Initialize Logistic Regression model
model = LogisticRegression(max_iter=1000)

# Use GridSearchCV for cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameter and corresponding accuracy
best_C = grid_search.best_params_['C']
best_score = grid_search.best_score_

print(f"Optimal Regularization Strength (C): {best_C}")
print(f"Cross-Validation Accuracy with Best C: {best_score:.4f}")

# Evaluate the model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy with Optimal C: {test_accuracy:.4f}")


### 25.Write a Python program to train Logistic Regression, save the trained model using joblib, and load it again to make predictions.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score
import joblib  # For saving and loading the model

# Load dataset from CSV file
df = pd.read_csv('amazon.csv')  # Replace with your actual dataset filename

# Convert categorical data into numerical values using Label Encoding
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))

# Prepare features (X) and target (y)
X = df.drop('review_content', axis=1)  # Replace 'target' with your actual target column name
y = df['review_title']

# Apply Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Save the trained model using joblib
joblib.dump(model, 'logistic_regression_model.joblib')
print("Model saved successfully!")

# Load the saved model
loaded_model = joblib.load('logistic_regression_model.joblib')
print("Model loaded successfully!")

# Make predictions using the loaded model
y_pred = loaded_model.predict(X_test)

# Evaluate the loaded model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the Loaded Model:", accuracy)

# Predict on new data (example)
# Replace this with actual new input data as needed
new_data = X_test[0].reshape(1, -1)
new_prediction = loaded_model.predict(new_data)
print("Prediction for new data:", new_prediction)
