#### Import Packages
---

In [1]:
import pandas as pd
import numpy as np
import altair as alt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

#### Q1:
---

In [2]:
s = pd.read_csv("social_media_usage.csv")
print(s.shape)

(1502, 89)


#### Q2:
---

In [3]:
def clean_sm(x):
    return np.where(x == 1, 1, 0)


toy_df = pd.DataFrame({'A':[1,0,1]})
toy_df['test'] = clean_sm(toy_df['A'])
print(toy_df)

   A  test
0  1     1
1  0     0
2  1     1


#### Q3:
---

In [4]:
ss = pd.DataFrame({
    "sm_li":clean_sm(s['web1h']),
    "income":np.where(s['income'] > 9, np.nan, s['income']),
    "education":np.where(s['educ2'] > 8,np.nan,s['educ2']),
    "parent":np.where(s['par'] == 1, 1, 0),
    "married":np.where(s['marital'] == 1, 1, 0),
    "female":np.where(s['gender'] == 2, 1, 0),
    "age":np.where(s['age'] > 98, np.nan, s['age'])
})

ss = ss.dropna()

alt.Chart(ss.groupby(["age", "sm_li"], as_index=False)["income"].mean()).\
mark_circle().\
encode(x="age",
      y="income",
      color="sm_li:N")

#### Q4:
---

In [5]:
y = ss["sm_li"]
X = ss[["income", "education", "parent", "married", "female", "age"]]

#### Q5:
---

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, stratify=y, test_size=0.2, random_state=952)

X_train has 80% of the data for predicting the target when training the model. \
X_test has 20% of the data for testing the model on unseen data to evaluate performance.

y_train contains 80% of the the data with the target that we will predict while training. \
y_test contains 20% of the data with the target to be predicted on unseen data.

#### Q6:
---

In [7]:
lr = LogisticRegression(class_weight = 'balanced')
lr.fit(X_train, y_train)


#### Q7:
---

In [8]:
y_pred = lr.predict(X_test)
print(confusion_matrix(y_test, y_pred))

[[115  53]
 [ 21  63]]


In the confusion Matrix, the top left quadrant symbolizes a Negative-Negative outcome where the prediction that a person would not have LinkedIn was correct, there were 115 correct predictions.

The bottom right symbolizes a Positive-Positive prediction where the model correctly predicted a person would have LinkedIn, there were 63 correct predictions.

The bottom left quadrant symbolizes a Positive-Negative relationship where the model predicted a person would not have LinkedIn however they actually do, there were 21 incorrect predictions.

The top right symbolizes a Negative-Positive relationship where the model predicted a person would have LinkedIn and they did not, there were 53 incorrect predictions.

#### Q8:
---

In [9]:
pd.DataFrame(confusion_matrix(y_test, y_pred),
    columns=["Predicted negative", "Predicted positive"], index=["Actual negative","Actual positive"])

Unnamed: 0,Predicted negative,Predicted positive
Actual negative,115,53
Actual positive,21,63


#### Q9:
---

In [10]:
precision = (63)/(63+53)
print(f"Precision: {precision}")
recall = (63)/(63+21)
print(f"Recall: {recall}")
f1score = 2 * ((precision*recall)/(precision+recall))
print(f"F1 Score: {f1score}")

Precision: 0.5431034482758621
Recall: 0.75
F1 Score: 0.63


In [11]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.68      0.76       168
           1       0.54      0.75      0.63        84

    accuracy                           0.71       252
   macro avg       0.69      0.72      0.69       252
weighted avg       0.74      0.71      0.71       252



Precision measures the proportion of true positive predictions among all positive predictions, it is helpful when the cost of false positives is high (like in cancer screening).

Recall measures the proportion of true positive predictions among all actual positive instances, which could also be critial in a hospital where missed cases can leave a patient untreated, or letting a fraudtser go undetected.

F1 score is a balance of the two and where there is uneven class distribution. If you seek a good balance between precision and recall then F1 score is useful.

#### Q10:
---

In [12]:
#person1 highincome, highly educated, female married non parent 42 year old
person1 = [8, 7, 0, 1, 1, 42]

print("Person 1: ")
print(lr.predict([person1]))
print(lr.predict_proba([person1]))

#person2 highincome, highly educated, female married non parent 82 year old
person2 = [8, 7, 0, 1, 1, 82]

print("\nPerson 2: ")
print(lr.predict([person2]))
print(lr.predict_proba([person2]))

# person 3 high income, fairly educated, male non married non parent 23 year old
person3 = [8, 6, 0, 0, 0, 23]
print("\nPerson 3: ")
print(lr.predict([person3]))
print(lr.predict_proba([person3]))

#person 4 (me) lower income, fairly educated, male, non married, non parent, 21 year old
person4 = [3, 6, 0, 0, 0, 21]
print("\nPerson 4: ")
print(lr.predict([person4]))
print(lr.predict_proba([person4]))

#person 5 grandpa
person5 = [5, 6, 1, 1, 0, 83]
print("\nPerson 5: ")
print(lr.predict([person5]))
print(lr.predict_proba([person5]))

Person 1: 
[1]
[[0.30006512 0.69993488]]

Person 2: 
[0]
[[0.52931 0.47069]]

Person 3: 
[1]
[[0.19628018 0.80371982]]

Person 4: 
[1]
[[0.47690719 0.52309281]]

Person 5: 
[0]
[[0.68204808 0.31795192]]
