<a href="https://colab.research.google.com/github/deepanshutyagi86/user_convert-/blob/main/Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd

In [2]:
np.random.seed(42) # set random seed to 42(random no.) so that the value doesn't changed every itme

In [3]:
n_users = 5000

In [4]:
session_time = np.random.exponential(scale = 240, size=n_users)#taking average session time as 4 minutes

In [5]:
pages_viewed = np.random.poisson(lam=3, size=n_users) # taking average 3 pages visits

In [6]:
past_visits = np.random.poisson(lam=1, size=n_users)# taking past visit avg = 1

In [7]:
traffic_source = np.random.choice(['direct', 'referral', 'ads'], size = n_users, p=[0.4, 0.3, 0.3]) # where the user came from


In [8]:
user_device = np.random.choice(['Mobile', 'Desktop'], size = n_users, p=[0.6, 0.4] ) # device from where user come .

In [9]:
# giving each source a number to affect conversion chance
source_effects = {'direct':0, 'referral':0.2, 'ads':-1 }
device_effects ={ 'Mobile':0 , 'Desktop':0.3}

In [10]:
#calculating score (log-odds) for each user to predict conversion
score =(
     -2 + 0.001*session_time + 0.5*pages_viewed + 0.3*past_visits
    + np.array([source_effects[s] for s in traffic_source])
    + np.array([device_effects[d] for d in user_device])

)

In [11]:
#turning the score into the probablity
# sigmoid function
chances = 1/(1+np.exp(-score))

In [12]:
# Decide if user convert(1) or not(0)
did_convert = (np.random.random(n_users)<chances).astype(int) #random roll vs probablity.


In [13]:
user_data = pd.DataFrame({
    'session_time': session_time,
    'pages_viewed': pages_viewed,
    'past_visits': past_visits,
    'traffic_source': traffic_source,
    'user_device': user_device,
    'did_convert': did_convert
})

In [14]:
user_data.head()

Unnamed: 0,session_time,pages_viewed,past_visits,traffic_source,user_device,did_convert
0,112.624342,4,2,direct,Mobile,0
1,722.429143,3,2,referral,Mobile,0
2,316.018966,3,1,direct,Desktop,1
3,219.106213,2,0,referral,Desktop,0
4,40.709969,0,1,direct,Mobile,0


In [15]:
user_data.tail()

Unnamed: 0,session_time,pages_viewed,past_visits,traffic_source,user_device,did_convert
4995,546.45389,9,0,direct,Mobile,1
4996,30.511253,4,1,referral,Mobile,1
4997,95.343142,2,0,direct,Desktop,0
4998,405.944802,5,1,direct,Mobile,1
4999,218.302596,0,1,ads,Desktop,0


In [16]:
user_data.shape

(5000, 6)

In [17]:
user_data.to_csv('user_data.csv',index=False)

In [18]:
user_data.isnull().sum()

Unnamed: 0,0
session_time,0
pages_viewed,0
past_visits,0
traffic_source,0
user_device,0
did_convert,0


In [19]:
user_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   session_time    5000 non-null   float64
 1   pages_viewed    5000 non-null   int64  
 2   past_visits     5000 non-null   int64  
 3   traffic_source  5000 non-null   object 
 4   user_device     5000 non-null   object 
 5   did_convert     5000 non-null   int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 234.5+ KB


So , now we have data, obvisouly the synthesied data.

because to get a real time data of the required factors is tough to extract.

**Now we are going to do work on this dataset**

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [23]:
# Preprocess the data
# First we convert the all data points of the traffic_source and user device into numbers using ONE HOT ENCODING
encoder = OneHotEncoder(sparse_output = False , drop = 'first') # to avoid duplicate info
encoded_features = encoder.fit_transform(user_data[['traffic_source', 'user_device']])
encoded_columns = encoder.get_feature_names_out(['traffic_source', 'user_device'])

In [24]:
encoded_data = pd.DataFrame(encoded_features, columns = encoded_columns)

In [25]:
other_features = user_data[['session_time', 'pages_viewed', 'past_visits']]
all_features = pd.concat([other_features , encoded_data], axis =1)

In [26]:
target = user_data['did_convert']

In [27]:
all_features.head()

Unnamed: 0,session_time,pages_viewed,past_visits,traffic_source_direct,traffic_source_referral,user_device_Mobile
0,112.624342,4,2,1.0,0.0,1.0
1,722.429143,3,2,0.0,1.0,1.0
2,316.018966,3,1,1.0,0.0,0.0
3,219.106213,2,0,0.0,1.0,0.0
4,40.709969,0,1,1.0,0.0,1.0


In [28]:
all_features.shape

(5000, 6)

In [29]:
X_train, X_test, y_train, y_test = train_test_split(all_features, target , test_size = 0.2, random_state = 42)

In [30]:
X_train.shape

(4000, 6)

In [31]:
X_test.shape

(1000, 6)

In [32]:
#building the model
#logistic Regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [33]:
y_pred = model.predict(X_test)

In [34]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)


In [35]:
print('\nModel accuracy', accuracy)


Model accuracy 0.706


In [36]:
print('\nReport', report) # for precision , recall, f-1 score, support


Report               precision    recall  f1-score   support

           0       0.70      0.73      0.71       506
           1       0.71      0.68      0.70       494

    accuracy                           0.71      1000
   macro avg       0.71      0.71      0.71      1000
weighted avg       0.71      0.71      0.71      1000



In [37]:
import pickle

In [38]:
#saving the model
with open('model.pkl', 'wb') as file:
    pickle.dump(model, file)

In [39]:
with open('encoder.pkl', 'wb') as file:
    pickle.dump(encoder, file)