<a href="https://colab.research.google.com/github/girish-ir/Deep-Reinforcement-Learning-for-Enterprise-Nanodegree/blob/master/CT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Importing required packages


1.   Pandas for dataframe wrangling
2.   json for saving and loading output dict checkpoints to avoid loss of data if colab session resets
3. datetime to stamp output files

In [2]:
import pandas as pd
import json
from datetime import datetime

In [None]:
user_file = '/content/drive/My Drive/data/Actual User Event.csv'
sensor_file = '/content/drive/My Drive/data/Sensor Data.csv'

In [3]:
cd /content/drive/My Drive/data/

/content/drive/My Drive/data


### Data prep steps:


*   Parsing date to datetime format
*   Separating date and time for easy comparision between user file and sensor file
* Renaming columns for easy handling





User activity dataload and prep

In [11]:
df_user = pd.read_csv('Actual User Event.csv', parse_dates=[1])
df_user.columns = ['Activity', 'Time', 'App_Dev']
df_user.Activity = df_user.Activity.str.lower()
df_user.sort_values(by = ['Time'], inplace=True)
df_user.head()

Unnamed: 0,Activity,Time,App_Dev
2624,reply to tweet,2020-03-12 04:35:44+05:30,Tweetbot for iΟS
2909,retweet,2020-03-12 04:36:20+05:30,Tweetbot for iΟS
2908,reply to tweet,2020-03-12 04:37:51+05:30,Tweetbot for iΟS
1755,tweet,2020-03-12 04:38:35+05:30,Tweetbot for iΟS
1754,retweet,2020-03-12 04:52:51+05:30,Tweetbot for iΟS


Sensor data load and parsing

In [None]:
df_sensor = pd.read_csv('Sensor Data.csv', parse_dates=[1])
df_sensor.head()

Unnamed: 0,Activity Type,Time,User ID
0,Tweet With Text,2020-03-13 20:53:00+05:30,889584149
1,Tweet With Text,2020-03-13 20:54:00+05:30,889584149
2,Tweet With Text,2020-03-13 20:55:00+05:30,889584149
3,Tweet With Text,2020-03-13 20:56:00+05:30,889584149
4,Tweet With Text,2020-03-13 21:00:00+05:30,889584149


In [None]:
df_sensor.columns = ['Activity', 'Time', 'User']
df_sensor.sort_values(by = ['User', 'Time'], inplace=True)
df_sensor.Activity = df_sensor.Activity.str.lower()
df_sensor.dtypes

Activity                                   object
Time        datetime64[ns, pytz.FixedOffset(330)]
User                                        int64
dtype: object

Activity type analysis:
There is no direct mapping provided till the time of this analysis. Following assumptions are made:


*   Tweet with GIF is assumed to be Tweet with Image
*   Tweet is assumed to be Tweet with Text
*   Retweet and Reply to tweet has no direct mapping hence it is assumed to direct match as sensors are inaccurate 





In [None]:
df_user.Activity.unique(), df_sensor['Activity'].unique()

(array(['reply to tweet', 'retweet', 'tweet', 'tweet with video',
        'tweet with image'], dtype=object),
 array(['tweet with text', 'tweet with image', 'tweet with gif',
        'tweet with video'], dtype=object))

In [None]:
df_user['activity_mapped'] = df_user.Activity
df_user.loc[df_user.activity_mapped == 'tweet', 'activity_mapped'] = 'tweet with text' 
df_sensor['activity_mapped'] = df_sensor.Activity
df_sensor.loc[df_sensor.activity_mapped == 'tweet with gif', 'activity_mapped'] = 'tweet with image' 
user_activity = df_user.activity_mapped.unique().tolist()
sensor_acivity = df_sensor.activity_mapped.unique().tolist()

Function to create date and time

In [None]:
def split_datetime(df):
    ''' This will use datetime column in the dataframe and create date and time columns'''
    df['date'] = df['Time'].dt.date
    df['time'] = df['Time'].dt.time
    return df

In [None]:
df_sensor = split_datetime(df_sensor)
df_user = split_datetime(df_user)

In [None]:
df_sensor.head()

Unnamed: 0,Activity,Time,User,date,time
371431,tweet with text,2020-03-12 02:39:00+05:30,500389515,2020-03-12,02:39:00
340293,tweet with text,2020-03-12 02:40:00+05:30,500389515,2020-03-12,02:40:00
340294,tweet with text,2020-03-12 02:41:00+05:30,500389515,2020-03-12,02:41:00
340295,tweet with text,2020-03-12 02:42:00+05:30,500389515,2020-03-12,02:42:00
371432,tweet with text,2020-03-12 02:49:00+05:30,500389515,2020-03-12,02:49:00


## Discussion on approach:
#### Given facts:
* As sensors accuracy is between 5% t0 80%. 
* Also, the false positive rate of event detection is 300%.
* Sensor data and User data observation period might not have full overlap 
* Activity type between user file and sensor file donot have complete overlap

 Therefore a partial match approach is suitable.

 #### Assumptions
 * Sensors are able to identify dates correctly
 * No information available on sensors capability to record time accurately. Hence, it is assumed that time is captured correctly. If there is a delay or error (+/-) in recording time same can be accomodated easily in below approach
* Above facts rules out event number of event matching.
* User file assumed to be correct and accurate 

#### Matching approach:
* Activity Type, Date and Time are relevant features and will be used to match the logs 
* Retweet and Reply to tweet are ignored for matching

Below function implements algorithm to match record observed match count.
* A full_match : Date + Time + Activity match
* A date_time_match : Date + Time match

In [None]:
def match_activity(df_user, df_sensor_filter, userid):
  '''
  1. Match date
  2. Match time
  3. Match activity
  4. if activity retweet and reply ignore activity match
  5. Optimize search to reduce time 
  '''
    #df_sensor_filter = df_sensor[df_sensor.User == userid].reset_index(drop=True)
    end_index = df_sensor_filter.shape[0]
    match_index = 0 
    date_time_match = 0
    full_match = 0
    for act in df_user.iterrows():
        for idx in range(match_index, end_index):
            if act[1].date == df_sensor_filter.loc[idx,'date']:
                if act[1].time == df_sensor_filter.loc[idx,'time']:
                    date_time_match += 1
                    match_index = idx
                    if act[1].activity_mapped in ['retweet', 'reply to tweet']:
                        full_match += 1
                    elif act[1].activity_mapped == df_sensor_filter.loc[idx,'activity_mapped']:
                        full_match += 1
                    break
    return date_time_match, full_match

Function will write output checkpoints in json on gdrive

In [None]:
def writefile(match_dict):
  '''Function will write output checkpoints in json on gdrive'''
  now = datetime.now()
  filename = 'dict{}'.format(datetime.now().strftime("%H:%M:%S"))
  with open(filename, 'w') as file:
    file.write(json.dumps(match_dict))

In [None]:
userids = df_sensor.User.unique().tolist() ## All userids in sensor file
#userids = userids[:10]

### Main function to search 
Before searching remove non-overlapping user and sensor data

In [None]:
match_dict = {}
for i, userid in enumerate(userids):
    df_sensor_filter = df_sensor[df_sensor.User == userid].reset_index(drop=True)
    mind,maxd= df_sensor_filter.date.min(), df_sensor_filter.date.max()
    dfu = df_user[(df_user.date >= mind) & (df_user.date <= maxd)]
    print('Processing for {}'.format(userid))
    item_dict = {}
    item_dict['date_time_match'], item_dict['full_match'] = match_activity(dfu, df_sensor_filter, userid)
    match_dict[userid] = item_dict
    print(userid, match_dict[userid])
    if i % 5 == 0:
        writefile(match_dict)
writefile(match_dict)

Processing for 831025230667
831025230667 {'date_time_match': 5, 'full_match': 4}
Processing for 831025304854
831025304854 {'date_time_match': 3, 'full_match': 1}
Processing for 831040073720
831040073720 {'date_time_match': 5, 'full_match': 3}
Processing for 831040132270
831040132270 {'date_time_match': 6, 'full_match': 4}
Processing for 831040188914
831040188914 {'date_time_match': 8, 'full_match': 7}
Processing for 831040280877
831040280877 {'date_time_match': 3, 'full_match': 3}
Processing for 831040408395


In [None]:
len(userids) 

98

Re-run code for user is left out for session reset

In [None]:
user_left = [itr for itr in userids if itr not in user_done]

Genrate composite score:


*   Scale the scores to less than unit
*   0.75 * full match + 0.25 * date time match
* sort values



In [12]:
results_df = pd.read_csv('/content/drive/My Drive/data/final_list.csv')
results_df['composite_match'] = results_df.date_time_match * 0.25 + results_df.full_match * 0.75
results_df.set_index(['User'], inplace=True)
results_df = results_df / results_df.sum()
results_df.sort_values(['composite_match'], ascending=False, inplace=True)
results_df.head()

Unnamed: 0_level_0,date_time_match,full_match,composite_match
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
555802093,0.042751,0.04712,0.045724
546629996,0.035316,0.034031,0.034442
548932008,0.031599,0.034031,0.033254
547792325,0.02974,0.031414,0.030879
889969367,0.024164,0.02356,0.023753


In [10]:
results_df.to_csv('scenario_2_output.csv')