In [784]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, Math, Latex
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import classification_report as report
from nltk.sentiment import SentimentIntensityAnalyzer
from scipy.stats import skew
from sklearn.preprocessing import StandardScaler

pd.set_option('display.max_columns', None)

## Logistic Regression:

### 1. Logistic regression is a classification algorithm used in machine learning to predict categorical outcomes, typically binary (e.g., yes/no, pass/fail, 0/1). Instead of predicting a continuous value like linear regression, logistic regression estimates the probability that a given input belongs to a particular class.
* Function: It applies the sigmoid function (also called the logistic function), which maps real-valued numbers to a probability range between 0 and 1.
* Output Interpretation: Since the output is a probability, a threshold (often 0.5) is set to determine the final classification.
* Use Cases: Spam detection, fraud detection, medical diagnoses (e.g., predicting whether a patient has a disease).

* Key Points to Remember:
    * predicts discrete categorical label
    * based on maximum likelihood estimation
    * dependent variable: categorical
    * independent variable: continuous numeric or categorical

 ####  A. Binary Classification
* Definition: A model that classifies data into two possible classes (e.g., "positive" or "negative", "spam" or "not spam").
    * Two Values: Yes/No, Pass/Fail, Male/Female
* Output Layer: Typically consists of a single neuron with an activation function like sigmoid to predict probabilities between 0 and 1 (closer to 0 → one class; closer to 1 → another class).
* Examples:
    * Detecting fraudulent transactions (fraud vs. no fraud).
    * Classifying emails as spam or not spam.
    * Predicting whether a stock price will go up or down.
		
#### B. Multi-Class Classification
* Definition: A model that classifies data into three or more possible categories (e.g., classifying types of flowers or identifying object types in images).
    * More than two values: Days of the week, Types of Credit Ratings, Number of Products Purchased
*  Output Layer: Contains multiple neurons (one per class) and typically uses a softmax activation function to assign probabilities across the available classes.
* Examples:
    * Identifying different breeds of dogs in an image.
    * Categorizing customer feedback into sentiment classes ("positive," "neutral," "negative").
    * Predicting the type of transaction (loan, investment, purchase).

	
##### In classification models, when dealing with discrete categories, a binary classifier predicts two outcomes, while a multi-class classifier can handle three or more distinct categories.

### 3. Creating Dummy Variables

* Dummy variables are binary (0 or 1) representations of categorical data. Instead of using text labels (e.g., "ART_AND_DESIGN", "AUTO_AND_VEHICLES"), we create separate columns where:
    - 1 indicates the presence of a category
    - 0 indicates absence

* for more than two categories, we will need to create dummy variables (transform x_1  into dummy variables). For example, we have a predictor with 4 categories:
    * ART_AND_DESIGN
    * AUTO_AND_VEHICLES
    * BEAUTY
    * BOOKS_AND_REFERENCE

$$ E(y | \text{Books\_AND\_Reference}) = \beta_0 \quad \text{when } x_1, x_2, x_3 = 0 $$

$$ E(y | \text{Art\_AND\_Design}) = \beta_0 + \beta_1 \quad \text{when } x_2, x_3 = 0 $$

$$ E(y | \text{Auto\_AND\_Vehicles}) = \beta_0 + \beta_2 \quad \text{when } x_1, x_3 = 0 $$

$$E(y | \text{Beauty}) = \beta_0 + \beta_3 \quad \text{when } x_1, x_2 = 0 $$




| Category              | X1 | X2 | X3 |
|----------------------|----|----|----|
| ART_AND_DESIGN      | 1  | 0  | 0  |
| AUTO_AND_VEHICLES   | 0  | 1  | 0  |
| BEAUTY             | 0  | 0  | 1  |
| BOOKS_AND_REFERENCE | 0  | 0  | 0  |

#### Steps to creating dummy variables 

1. Find variables that are non-numeric (e.g., "Category", "Region", "Industry").
2. Convert categorical variables into binary columns using pd.get_dummies()
    *  import pandas as pd
    *  df_dummies = pd.get_dummies(df, drop_first=True, prefix=['Color'])  # Drops one category as reference
    *  Key Parameters:
    *  columns: Specifies which categorical columns to encode.
    *  drop_first=True: Drops the first category to avoid multicollinearity (useful in regression models).
    *  prefix: Allows custom prefixes for column names.

3.  Choose a Reference Category to drop to avoid multi-collinearity
    * The ommitted category becomes the baseline for comparison
      
5.  Include dummy variables in regression model
   
7.  Interpret Coefficients
    * Each dummy variable's coefficient shows how much it differs from the reference variable
    * Example: If "BOOKS_AND_REFERENCE" is the reference, $ ( \beta_1 )$ tells how "ART_AND_DESIGN" affects ( y ) compared to "BOOKS_AND_REFERENCE"

### Model Evaluation

How to Interpret Each Metric: Class-Specific Metrics (per label)

* Each row corresponds to a class (e.g., 0 and 1 for binary classification).
    * Precision = Correct positive predictions / All predicted positives
        → High precision means few false positives.
    * Recall = Correct positive predictions / All actual positives
        → High recall means fewer false negatives.
    * F1-score = Harmonic mean of precision & recall
        → Good balance between avoiding false positives and false negatives.
    * Support = Number of actual occurrences of the class in the test set
        → Helps understand the distribution of samples.

* Overall Model Metrics
    * Accuracy = (Correct predictions) / (Total samples)
        → Measures overall performance.
	* Macro Avg = Unweighted mean across all classes
        → Useful for handling class imbalance.
	* Weighted Avg = Averages precision, recall, and F1-weighted by class frequency
        → More representative if class distribution is imbalanced.

* Key Takeaways
  * If precision is low, the model may be misclassifying too many false positives.
  * If recall is low, the model may be missing real positive cases.
  * If F1-score varies significantly across classes, the model may be struggling with imbalance.
  * If accuracy is high but recall for a class is low, the model may be biased toward majority classes.


In [216]:
#Logical Binary Classification using titanic dataset

#create dataframe
titanic = pd.read_csv("titanictrain.csv")
titanictest = pd.read_csv("titanictest.csv")
titanicytest=pd.read_csv("gender_submission.csv")

#inspect data
titanic.info() 
titanictest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass  

In [218]:
#clean missing data in Age,Embarked, Cabin

titanic["Age"]=titanic["Age"].fillna(titanic["Age"].median())  # Fill with median age
titanic["Embarked"]=titanic["Embarked"].fillna(titanic["Embarked"].mode()[0])  # Fill with most common value
titanic["Cabin"]=titanic["Cabin"].fillna("Unknown")  # Replace missing Cabin data
titanictest["Age"]=titanictest["Age"].fillna(titanictest["Age"].median())  # Fill with median age
titanictest["Embarked"]=titanictest["Embarked"].fillna(titanictest["Embarked"].mode()[0])  # Fill with most common value
titanictest["Cabin"]=titanictest["Cabin"].fillna("Unknown")  # Replace missing Cabin data

display(titanic.head(),titanictest.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,Unknown,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,Unknown,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,Unknown,S


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,Unknown,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,Unknown,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,Unknown,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,Unknown,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,Unknown,S


In [220]:
#create dummy data for gender

titanic['gender']=pd.get_dummies(titanic.Sex,drop_first=True)
titanictest['gender']=pd.get_dummies(titanictest.Sex,drop_first=True)



In [222]:
display(titanic.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,gender
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,Unknown,S,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,False
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,Unknown,S,False
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,False
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,Unknown,S,True


In [226]:
titanic[['Survived','gender','Pclass','Fare','SibSp','Parch']].corr()

Unnamed: 0,Survived,gender,Pclass,Fare,SibSp,Parch
Survived,1.0,-0.543351,-0.338481,0.257307,-0.035322,0.081629
gender,-0.543351,1.0,0.1319,-0.182333,-0.114631,-0.245489
Pclass,-0.338481,0.1319,1.0,-0.5495,0.083081,0.018443
Fare,0.257307,-0.182333,-0.5495,1.0,0.159651,0.216225
SibSp,-0.035322,-0.114631,0.083081,0.159651,1.0,0.414838
Parch,0.081629,-0.245489,0.018443,0.216225,0.414838,1.0


In [166]:
#create dependent and independent variables

y = titanic['Survived']
ytest=titanicytest['Survived']

x = titanic[['Pclass','gender']]
xtest=titanictest[['Pclass','gender']]

In [168]:
#split data
x_train,x_test,y_train,y_test= train_test_split(x,y,test_size = 0.2, random_state = 56)

x_train.head(5)

Unnamed: 0,gender
226,True
278,True
31,False
449,True
632,True


In [170]:
#train the model

survivalmodel = LogisticRegression()

survivalmodel.fit(x_train,y_train)

In [172]:
#make predictions

ypredict = survivalmodel.predict(x_test)
accuracy = survivalmodel.score(x_test,y_test)

print(f'Predictions:{ypredict} \nAccuracy:{round(accuracy,3)}')

Predictions:[0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 0 0 0 1
 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 1
 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0
 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 1 0 1
 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1] 
Accuracy:0.872


In [174]:
#print classification report to evaluate the model

titanic_results = report(y_test,ypredict)

print(f'Classification Report:\n {titanic_results}')

Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.93      0.90       116
           1       0.86      0.76      0.81        63

    accuracy                           0.87       179
   macro avg       0.87      0.85      0.86       179
weighted avg       0.87      0.87      0.87       179



In [176]:
#train the model

survivalmodel1 = LogisticRegression()

survivalmodel1.fit(x,y)

In [232]:
y_predict2=survivalmodel1.predict(xtest)
accuracy2=survivalmodel1.score(xtest,ytest)

print(f'Predictions:{y_predict2} \nAccuracy:{round(accuracy,3)}')

Predictions:[0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 1
 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0
 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0
 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0
 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0
 0 1 1 1 1 1 0 1 0 0 0] 
Accuracy:0.872


In [228]:
titanic_results2 = report(ytest,y_predict2)

print(f'Classification Report:\n {titanic_results2}')

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       266
           1       1.00      1.00      1.00       152

    accuracy                           1.00       418
   macro avg       1.00      1.00      1.00       418
weighted avg       1.00      1.00      1.00       418



In [600]:
#multiclass classification

disease = pd.read_csv("Disease_symptom_and_patient_profile_dataset.csv")


disease.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Disease               344 non-null    object
 1   Fever                 344 non-null    object
 2   Cough                 344 non-null    object
 3   Fatigue               344 non-null    object
 4   Difficulty Breathing  344 non-null    object
 5   Age                   344 non-null    int64 
 6   Gender                344 non-null    object
 7   Blood Pressure        344 non-null    object
 8   Cholesterol Level     344 non-null    object
 9   Outcome Variable      344 non-null    object
dtypes: int64(1), object(9)
memory usage: 27.0+ KB


In [604]:
disease.head(10)

Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level,Outcome Variable
0,Allergic Rhinitis,No,Yes,Yes,No,29,Female,Normal,Low,Negative
1,Allergic Rhinitis,No,No,Yes,No,35,Female,Normal,Low,Negative
2,Allergic Rhinitis,No,Yes,No,No,38,Female,Low,Normal,Negative
3,Allergic Rhinitis,No,Yes,Yes,No,45,Male,High,Normal,Negative
4,Allergic Rhinitis,Yes,Yes,Yes,No,45,Male,High,Normal,Positive
5,Allergic Rhinitis,Yes,No,No,No,50,Male,High,High,Negative
6,Alzheimer's Disease,No,Yes,No,No,65,Male,Normal,High,Positive
7,Alzheimer's Disease,No,Yes,No,No,65,Male,Normal,High,Positive
8,Alzheimer's Disease,No,No,Yes,No,65,Female,High,High,Positive
9,Alzheimer's Disease,Yes,No,Yes,No,70,Female,High,Normal,Negative


In [606]:
disease['Outcome Variable'].value_counts()

Outcome Variable
Positive    185
Negative    159
Name: count, dtype: int64

In [608]:
#convert to diseases to a list so i can map the values
Diseases=disease['Disease'].unique()
Diseases=list(Diseases)

In [610]:
#create a disease map 
diseasemap = {value:index for index,value in enumerate(Diseases,start=1)}

#create a new column in the dataframe with the mapped values
disease['diseasemap']=disease.Disease.map(diseasemap)

print(diseasemap)

{'Allergic Rhinitis': 1, "Alzheimer's Disease": 2, 'Anxiety Disorders': 3, 'Appendicitis': 4, 'Asthma': 5, 'Atherosclerosis': 6, 'Bipolar Disorder': 7, 'Cancer': 8, 'Cataracts': 9, 'Cerebral Palsy': 10, 'Chickenpox': 11, 'Cholecystitis': 12, 'Cholera': 13, 'Chronic Kidney Disease': 14, 'Chronic Obstructive Pulmonary Disease (COPD)': 15, 'Conjunctivitis (Pink Eye)': 16, 'Coronary Artery Disease': 17, 'Irritable Bowel Disease': 18, 'Cystic Fibrosis': 19, 'Dementia': 20, 'Mosquito Born Disease (Malaria,Zika Virus, Dengue)': 21, 'Depression': 22, 'Diabetes': 23, 'Diverticulitis': 24, 'Eating Disorders (Anorexia,...': 25, 'Ebola Virus': 26, 'Eczema': 27, 'Endometriosis': 28, 'Epilepsy': 29, 'Fibromyalgia': 30, 'Gastroenteritis': 31, 'Glaucoma': 32, 'Gout': 33, 'Hemophilia': 34, 'Hemorrhoids': 35, 'Hepatitis/Hepatits B': 36, 'HIV/AIDS': 37, 'Hyperglycemia': 38, 'Hypertension': 39, 'Hyperthyroidism': 40, 'Hypoglycemia': 41, 'Hypothyroidism': 42, 'Kidney Disease': 43, 'Klinefelter Syndrome': 4

In [612]:
disease['genderd']=pd.get_dummies(disease.Gender,drop_first=True)
disease['outcome']=pd.get_dummies(disease['Outcome Variable'],drop_first=True)
disease[['fever','cough','fatigue','breathing']]=pd.get_dummies(disease[['Fever','Cough','Fatigue','Difficulty Breathing']],drop_first=True)

In [614]:
dictionary_map={'High':0,'Normal':1,'Low':2}

disease['BP']=disease['Blood Pressure'].map(dictionary_map)
disease['cholesterol']=disease['Cholesterol Level'].map(dictionary_map)

In [618]:
#create new dataframe of the transformed categories
diseasecleaned=disease[['Disease','diseasemap','fever','cough','fatigue','breathing','Age',
                         'genderd','BP','cholesterol','outcome']].copy()

diseasecleaned=diseasecleaned.sort_values('diseasemap').reset_index()


In [648]:
diseasecleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   index        344 non-null    int64 
 1   Disease      344 non-null    object
 2   diseasemap   344 non-null    int64 
 3   fever        344 non-null    bool  
 4   cough        344 non-null    bool  
 5   fatigue      344 non-null    bool  
 6   breathing    344 non-null    bool  
 7   Age          344 non-null    int64 
 8   genderd      344 non-null    bool  
 9   BP           344 non-null    int64 
 10  cholesterol  344 non-null    int64 
 11  outcome      344 non-null    bool  
dtypes: bool(6), int64(5), object(1)
memory usage: 18.3+ KB


In [677]:
diseasecleaned[['outcome','fever','Age','breathing','genderd','fatigue','diseasemap']].corr()

Unnamed: 0,outcome,fever,Age,breathing,genderd,fatigue,diseasemap
outcome,1.0,0.173623,0.054384,0.089192,-0.126971,0.145395,0.010691
fever,0.173623,1.0,-0.009466,0.256314,0.023567,-0.069926,0.059734
Age,0.054384,-0.009466,1.0,-0.206971,0.041316,0.103206,0.111114
breathing,0.089192,0.256314,-0.206971,1.0,0.09019,0.077317,-0.061138
genderd,-0.126971,0.023567,0.041316,0.09019,1.0,-0.005441,0.003781
fatigue,0.145395,-0.069926,0.103206,0.077317,-0.005441,1.0,0.039265
diseasemap,0.010691,0.059734,0.111114,-0.061138,0.003781,0.039265,1.0


In [665]:
#correlation matrix
xdisease = diseasecleaned[['diseasemap','fever','Age','breathing','genderd','fatigue','cough']]
yd=diseasecleaned['outcome']

In [667]:
#split data
xd_train,xd_test,yd_train,yd_test= train_test_split(xdisease,yd,test_size = 0.2, random_state = 56)

xd_train.head(5)

Unnamed: 0,diseasemap,fever,Age,breathing,genderd,fatigue,cough
189,45,False,55,False,True,True,True
255,64,True,55,False,False,False,True
90,18,True,35,False,True,False,False
157,39,False,50,False,False,False,True
40,5,True,25,True,True,False,True


In [669]:
#train the model
diseasemodel=LogisticRegression()
diseasemodel.fit(xd_train,yd_train)

In [671]:
#get predictions and score
diseasepredict = diseasemodel.predict(xd_test)
score=diseasemodel.score(xd_test,yd_test)

print(f'Predictions:{diseasepredict} \nAccuracy:{round(score,3)}')

Predictions:[False  True  True  True  True  True False  True False False  True False
 False  True  True False False False  True False  True  True False  True
 False  True  True False  True  True False False  True  True  True  True
  True  True  True False False  True  True  True  True  True False  True
  True False False False  True  True  True  True  True  True  True  True
 False False False  True  True  True  True False False] 
Accuracy:0.565


In [673]:
#evaluate results
diseasereport = report(yd_test,diseasepredict)

print(f'Classification Report: {diseasereport}')

Classification Report:               precision    recall  f1-score   support

       False       0.65      0.45      0.53        38
        True       0.51      0.71      0.59        31

    accuracy                           0.57        69
   macro avg       0.58      0.58      0.56        69
weighted avg       0.59      0.57      0.56        69



In [751]:
#logistic regression 

bots = pd.read_csv("bots_vs_users.csv")

bots.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5874 entries, 0 to 5873
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   has_domain                5874 non-null   object
 1   has_birth_date            5874 non-null   object
 2   has_photo                 5874 non-null   object
 3   can_post_on_wall          5874 non-null   object
 4   can_send_message          5874 non-null   object
 5   has_website               5874 non-null   object
 6   gender                    5874 non-null   object
 7   has_short_name            5874 non-null   object
 8   has_first_name            5874 non-null   object
 9   has_last_name             5874 non-null   object
 10  access_to_closed_profile  5874 non-null   object
 11  is_profile_closed         5874 non-null   object
 12  target                    5874 non-null   int64 
dtypes: int64(1), object(12)
memory usage: 596.7+ KB


In [755]:
fixcols=['has_domain', 'has_birth_date', 'has_photo', 'can_post_on_wall',
       'can_send_message', 'has_website', 'gender', 'has_short_name',
       'has_first_name', 'has_last_name', 'access_to_closed_profile',
       'is_profile_closed']

bots[fixcols] = bots[fixcols].apply(lambda col: col.str.replace('Unknown', '0')).astype(int)

In [759]:
bots.corr()

Unnamed: 0,has_domain,has_birth_date,has_photo,can_post_on_wall,can_send_message,has_website,gender,has_short_name,has_first_name,has_last_name,access_to_closed_profile,is_profile_closed,target
has_domain,1.0,0.15508,0.070843,0.012842,0.057099,0.023483,0.179837,1.0,1.0,0.893968,0.163323,0.024673,0.048038
has_birth_date,0.15508,1.0,-0.192052,-0.019826,-0.124302,-0.04266,-0.00094,0.15508,0.15508,0.139629,0.180134,-0.153039,0.240285
has_photo,0.070843,-0.192052,1.0,0.174172,0.770161,0.303947,0.097623,0.070843,0.070843,0.079246,-0.267991,0.285027,-0.794621
can_post_on_wall,0.012842,-0.019826,0.174172,1.0,0.200015,0.090468,0.011643,0.012842,0.012842,0.014365,0.078629,-0.077233,-0.156333
can_send_message,0.057099,-0.124302,0.770161,0.200015,1.0,0.28189,0.115448,0.057099,0.057099,0.063871,-0.103142,0.115372,-0.602869
has_website,0.023483,-0.04266,0.303947,0.090468,0.28189,1.0,0.079132,0.023483,0.023483,0.018878,0.140682,-0.138088,-0.34977
gender,0.179837,-0.00094,0.097623,0.011643,0.115448,0.079132,1.0,0.179837,0.179837,0.162464,0.054062,-0.020581,0.017341
has_short_name,1.0,0.15508,0.070843,0.012842,0.057099,0.023483,0.179837,1.0,1.0,0.893968,0.163323,0.024673,0.048038
has_first_name,1.0,0.15508,0.070843,0.012842,0.057099,0.023483,0.179837,1.0,1.0,0.893968,0.163323,0.024673,0.048038
has_last_name,0.893968,0.139629,0.079246,0.014365,0.063871,0.018878,0.162464,0.893968,0.893968,1.0,0.161615,0.00624,0.057319


In [802]:
#assign y and x variables
xcols=['has_domain', 'has_birth_date', 'has_photo', 'can_post_on_wall',
       'can_send_message', 'has_website', 'gender', 'access_to_closed_profile',
       'is_profile_closed']

yb= bots['target']
xb=bots[xcols]


In [804]:
#split test train data

xb_train,xb_test,yb_train,yb_test = train_test_split(xb,yb,test_size = 0.2,random_state=40)

In [806]:
scaler = StandardScaler()
xb_train = scaler.fit_transform(xb_train)
xb_test = scaler.transform(xb_test)

In [808]:
#train model

botsmodel = LogisticRegression()
botsmodel.fit(xb_train,yb_train)

In [810]:
#predictions
botpredict = botsmodel.predict(xb_test)
botaccuracy = botsmodel.score(xb_test,yb_test)

print(f'Bot Prediction: {botpredict}\n\
Accuracy: {botaccuracy}')

Bot Prediction: [1 0 1 ... 1 0 0]
Accuracy: 0.9123404255319149


In [814]:
#classification report
botreport = report(yb_test,botpredict)

print(f'Classification Report:\n {botreport}')

Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.97      0.92       590
           1       0.96      0.86      0.91       585

    accuracy                           0.91      1175
   macro avg       0.92      0.91      0.91      1175
weighted avg       0.92      0.91      0.91      1175



### Analysis of Results

#### Class-Level Performance
- Class 0 (Users) → Precision: 87%, Recall: 97%, F1-score: 92%
    - High recall (97%) → Model correctly identifies most users.
    - Lower precision (87%) → Some bots are misclassified as users.
    - F1-score (92%) → Balances precision & recall well.
- Class 1 (Bots) → Precision: 96%, Recall: 86%, F1-score: 91%
    - High precision (96%) → Most predicted bots are actually bots.
    - Lower recall (86%) → Some bots are misclassified as users.
    - F1-score (91%) → Strong overall performance

#### Overall Model Performance
- Accuracy (91%) → The model correctly predicts 91% of all cases.
- Macro Avg (92% Precision, 91% Recall, 91% F1-score) → Simple average across classes.
- Weighted Avg (92% Precision, 91% Recall, 91% F1-score) → Accounts for class imbalance.

The classification model achieved 91% accuracy in distinguishing bots from users. It demonstrated high precision for bots (96%) but slightly lower recall (86%), meaning some bots were misclassified as users. Conversely, users had strong recall (97%) but lower precision (87%), indicating some users were misclassified as bots. Overall, the model performs well but could benefit from improvements in recall for bots
