Objective: To create a ML model to predict the person's personality(Extrovert/Introvert). 

In [1]:
import pandas as pd
df=pd.read_csv(r'personality_dataset.csv')
df.head()

Unnamed: 0,Time_spent_Alone,Stage_fear,Social_event_attendance,Going_outside,Drained_after_socializing,Friends_circle_size,Post_frequency,Personality
0,4.0,No,4.0,6.0,No,13.0,5.0,Extrovert
1,9.0,Yes,0.0,0.0,Yes,0.0,3.0,Introvert
2,9.0,Yes,1.0,2.0,Yes,5.0,2.0,Introvert
3,0.0,No,6.0,7.0,No,14.0,8.0,Extrovert
4,3.0,No,9.0,4.0,No,8.0,5.0,Extrovert


In [2]:
df.shape

(2900, 8)

In [3]:
df.isnull().sum()

Time_spent_Alone             63
Stage_fear                   73
Social_event_attendance      62
Going_outside                66
Drained_after_socializing    52
Friends_circle_size          77
Post_frequency               65
Personality                   0
dtype: int64

In [4]:
df.dtypes

Time_spent_Alone             float64
Stage_fear                    object
Social_event_attendance      float64
Going_outside                float64
Drained_after_socializing     object
Friends_circle_size          float64
Post_frequency               float64
Personality                   object
dtype: object

In [5]:
df = df.apply(lambda x: x.fillna(x.mode()[0]) if x.isnull().any() else x)
df

Unnamed: 0,Time_spent_Alone,Stage_fear,Social_event_attendance,Going_outside,Drained_after_socializing,Friends_circle_size,Post_frequency,Personality
0,4.0,No,4.0,6.0,No,13.0,5.0,Extrovert
1,9.0,Yes,0.0,0.0,Yes,0.0,3.0,Introvert
2,9.0,Yes,1.0,2.0,Yes,5.0,2.0,Introvert
3,0.0,No,6.0,7.0,No,14.0,8.0,Extrovert
4,3.0,No,9.0,4.0,No,8.0,5.0,Extrovert
...,...,...,...,...,...,...,...,...
2895,3.0,No,7.0,6.0,No,6.0,6.0,Extrovert
2896,3.0,No,8.0,3.0,No,14.0,9.0,Extrovert
2897,4.0,Yes,1.0,1.0,Yes,4.0,0.0,Introvert
2898,11.0,Yes,1.0,0.0,Yes,2.0,0.0,Introvert


In [6]:
df.isnull().sum()

Time_spent_Alone             0
Stage_fear                   0
Social_event_attendance      0
Going_outside                0
Drained_after_socializing    0
Friends_circle_size          0
Post_frequency               0
Personality                  0
dtype: int64

In [7]:
df.drop(['Post_frequency'], axis=1, inplace=True)

In [8]:
df.columns

Index(['Time_spent_Alone', 'Stage_fear', 'Social_event_attendance',
       'Going_outside', 'Drained_after_socializing', 'Friends_circle_size',
       'Personality'],
      dtype='object')

In [9]:
df.describe()

Unnamed: 0,Time_spent_Alone,Social_event_attendance,Going_outside,Friends_circle_size
count,2900.0,2900.0,2900.0,2900.0
mean,4.407931,3.921379,2.931724,6.235172
std,3.503333,2.886616,2.266215,4.237255
min,0.0,0.0,0.0,0.0
25%,1.0,2.0,1.0,3.0
50%,3.0,3.0,3.0,5.0
75%,7.0,6.0,5.0,10.0
max,11.0,10.0,7.0,15.0


Encoding: One Hot Encoding

In [10]:
!pip install category_encoders



In [11]:
import category_encoders as ce
encoder=ce.OneHotEncoder(cols=['Stage_fear', 'Drained_after_socializing'])
df=encoder.fit_transform(df) 
df.head(5)

Unnamed: 0,Time_spent_Alone,Stage_fear_1,Stage_fear_2,Social_event_attendance,Going_outside,Drained_after_socializing_1,Drained_after_socializing_2,Friends_circle_size,Personality
0,4.0,1,0,4.0,6.0,1,0,13.0,Extrovert
1,9.0,0,1,0.0,0.0,0,1,0.0,Introvert
2,9.0,0,1,1.0,2.0,0,1,5.0,Introvert
3,0.0,1,0,6.0,7.0,1,0,14.0,Extrovert
4,3.0,1,0,9.0,4.0,1,0,8.0,Extrovert


Split the dataset into input features and target column

In [13]:
x=df.drop('Personality',axis=1) 
y=df['Personality'] 
x.head(5)

Unnamed: 0,Time_spent_Alone,Stage_fear_1,Stage_fear_2,Social_event_attendance,Going_outside,Drained_after_socializing_1,Drained_after_socializing_2,Friends_circle_size
0,4.0,1,0,4.0,6.0,1,0,13.0
1,9.0,0,1,0.0,0.0,0,1,0.0
2,9.0,0,1,1.0,2.0,0,1,5.0
3,0.0,1,0,6.0,7.0,1,0,14.0
4,3.0,1,0,9.0,4.0,1,0,8.0


In [14]:
y.head(5)

0    Extrovert
1    Introvert
2    Introvert
3    Extrovert
4    Extrovert
Name: Personality, dtype: object

Divide into Train dataset and test dataset

In [16]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=0) 

In [17]:
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((2175, 8), (725, 8), (2175,), (725,))

In [18]:
x_test.head()

Unnamed: 0,Time_spent_Alone,Stage_fear_1,Stage_fear_2,Social_event_attendance,Going_outside,Drained_after_socializing_1,Drained_after_socializing_2,Friends_circle_size
582,1.0,1,0,5.0,5.0,1,0,8.0
1914,10.0,0,1,2.0,2.0,0,1,2.0
1074,1.0,1,0,7.0,7.0,1,0,14.0
1827,3.0,1,0,6.0,4.0,1,0,5.0
667,2.0,1,0,5.0,6.0,1,0,7.0


In [19]:
y_test.head()

582     Introvert
1914    Introvert
1074    Extrovert
1827    Extrovert
667     Extrovert
Name: Personality, dtype: object

Apply ML algo into training data

In [20]:
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(x_train,y_train)

Model Prediction

In [21]:
y_pred=model.predict(x_test)
y_pred[:5]

array(['Extrovert', 'Introvert', 'Extrovert', 'Extrovert', 'Extrovert'],
      dtype=object)

In [22]:
y_test[:5]

582     Introvert
1914    Introvert
1074    Extrovert
1827    Extrovert
667     Extrovert
Name: Personality, dtype: object

Model Evaluation

In [23]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
cm=confusion_matrix(y_test, y_pred)
cm

array([[326,  20],
       [ 42, 337]], dtype=int64)

In [24]:
accuracy_score(y_test, y_pred)*100

91.44827586206897

In [25]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

   Extrovert       0.89      0.94      0.91       346
   Introvert       0.94      0.89      0.92       379

    accuracy                           0.91       725
   macro avg       0.91      0.92      0.91       725
weighted avg       0.92      0.91      0.91       725



KNN

In [26]:
from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier(n_neighbors=5)
model.fit(x_train,y_train)

In [27]:
y_pred=model.predict(x_test)
y_pred[:5]

array(['Extrovert', 'Introvert', 'Extrovert', 'Extrovert', 'Extrovert'],
      dtype=object)

In [28]:
y_test[:5]

582     Introvert
1914    Introvert
1074    Extrovert
1827    Extrovert
667     Extrovert
Name: Personality, dtype: object

In [29]:
cm=confusion_matrix(y_test, y_pred)
cm

array([[326,  20],
       [ 33, 346]], dtype=int64)

In [30]:
accuracy_score(y_test, y_pred)*100

92.6896551724138

In [31]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

   Extrovert       0.91      0.94      0.92       346
   Introvert       0.95      0.91      0.93       379

    accuracy                           0.93       725
   macro avg       0.93      0.93      0.93       725
weighted avg       0.93      0.93      0.93       725



Summary: Using Logistic regression we have got accuracy score of 91.44% and using knn algorithm we have got 92.68% accuracy.