# Introduction
### This study compares logistic regression and decision tree classifiers using a specific dataset. Both models are evaluated based on accuracy and performance metrics. The objective is to determine which algorithm better suits the dataset, highlighting the strengths and weaknesses of each approach in a classification context.

### Importing all the libraries

In [13]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier,export_graphviz
from sklearn.model_selection import train_test_split
df=pd.read_csv(r"E:\Datascience\Data set\drug200.csv")


### Printing the first 20 rows

In [14]:
print(df.head(20))

    Age Sex      BP Cholesterol  Na_to_K   Drug
0    23   F    HIGH        HIGH   25.355  DrugY
1    47   M     LOW        HIGH   13.093  drugC
2    47   M     LOW        HIGH   10.114  drugC
3    28   F  NORMAL        HIGH    7.798  drugX
4    61   F     LOW        HIGH   18.043  DrugY
5    22   F  NORMAL        HIGH    8.607  drugX
6    49   F  NORMAL        HIGH   16.275  DrugY
7    41   M     LOW        HIGH   11.037  drugC
8    60   M  NORMAL        HIGH   15.171  DrugY
9    43   M     LOW      NORMAL   19.368  DrugY
10   47   F     LOW        HIGH   11.767  drugC
11   34   F    HIGH      NORMAL   19.199  DrugY
12   43   M     LOW        HIGH   15.376  DrugY
13   74   F     LOW        HIGH   20.942  DrugY
14   50   F  NORMAL        HIGH   12.703  drugX
15   16   F    HIGH      NORMAL   15.516  DrugY
16   69   M     LOW      NORMAL   11.455  drugX
17   43   M    HIGH        HIGH   13.972  drugA
18   23   M     LOW        HIGH    7.298  drugC
19   32   F    HIGH      NORMAL   25.974

### Data Cleaning

In [15]:
df=pd.read_csv(r"E:\Datascience\Data set\drug200.csv")
df.columns = [col.replace(" ", "_") for col in df.columns]
df.dropna(inplace=True)
numerical_columns = df.select_dtypes(include=['number']).columns
data_without_numerical = df.drop(columns=numerical_columns)
print(data_without_numerical)
categorical_column = 'categorical_column'
if categorical_column in df.columns:
    data_without_numerical[categorical_column].fillna('Unknown', inplace=True)
df.drop_duplicates(inplace=True)
print(df.shape)
df.isna().sum()
df.info()

    Sex      BP Cholesterol   Drug
0     F    HIGH        HIGH  DrugY
1     M     LOW        HIGH  drugC
2     M     LOW        HIGH  drugC
3     F  NORMAL        HIGH  drugX
4     F     LOW        HIGH  DrugY
..   ..     ...         ...    ...
195   F     LOW        HIGH  drugC
196   M     LOW        HIGH  drugC
197   M  NORMAL        HIGH  drugX
198   M  NORMAL      NORMAL  drugX
199   F     LOW      NORMAL  drugX

[200 rows x 4 columns]
(200, 6)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          200 non-null    int64  
 1   Sex          200 non-null    object 
 2   BP           200 non-null    object 
 3   Cholesterol  200 non-null    object 
 4   Na_to_K      200 non-null    float64
 5   Drug         200 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 10.9+ KB


### Searching Duplicates

In [16]:
print(df.duplicated().sum())

0


### One-hot Encoding

In [17]:
df=pd.get_dummies(df,columns=["Sex","BP","Cholesterol"],drop_first=True)

### Logistic Regression Model Building

In [18]:
import warnings 
warnings.filterwarnings("ignore", category=UserWarning)
y=df['Drug']
x=df.drop('Drug',axis=1)

x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.8,random_state=33)
lr=LogisticRegression()
model=lr.fit(x_train,y_train)
model.score(x_test,y_test)

0.95

### Prediction

In [19]:
model.predict([[18,30,True,False,False,True]])

array(['DrugY'], dtype=object)

### Accuracy Checking With Decision Tree

In [20]:
tree=DecisionTreeClassifier()
model=tree.fit(x_train,y_train)
model.score(x_test,y_test)

0.975

### Here the logistic regression and decision tree algorithm had been performed amoung that decision tree algorithm has more accuracy

# Summary
### Logistic regression and decision tree classifiers were applied to a given dataset to assess their predictive performance. Logistic regression, a linear model, is often used for binary classification due to its interpretability. In contrast, decision trees are non-linear models capable of capturing complex patterns. After training and testing both models, the decision tree achieved higher accuracy, indicating it better fits the underlying data structure. The comparison reveals that while logistic regression is simpler and computationally efficient, decision trees offer superior performance in datasets with non-linear relationships. The results underscore the importance of model selection based on data characteristics.



# Conclusion
### The decision tree outperformed logistic regression in this analysis, achieving higher accuracy on the dataset. While logistic regression remains valuable for linear problems and interpretability, decision trees are more effective in handling complex patterns. This comparison emphasizes the need for data-driven model selection to achieve optimal predictive outcomes.