# Data Sets

The data is contained in three files:

* portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
* profile.json - demographic data for each customer
* transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

**portfolio.json**
* id (string) - offer id
* offer_type (string) - type of offer ie BOGO, discount, informational
* difficulty (int) - minimum required spend to complete an offer
* reward (int) - reward given for completing an offer
* duration (int) - time for offer to be open, in days
* channels (list of strings)

**profile.json**
* age (int) - age of the customer 
* became_member_on (int) - date when customer created an app account
* gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
* id (str) - customer id
* income (float) - customer's income

**transcript.json**
* event (str) - record description (ie transaction, offer received, offer viewed, etc.)
* person (str) - customer id
* time (int) - time in hours since start of test. The data begins at time t=0
* value - (dict of strings) - either an offer id or transaction amount depending on the record

In [4]:
import pandas as pd
import numpy as np
import math
import json
import matplotlib.pyplot as plt

%matplotlib inline

# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

# USERS

profile DataFrame analysis conclusions:
* There are 17 000 users
* 14 825 users have all the information
* for the users with age '118' we will assume that the user diden't provide this information.' 
* 2 175 users dosen't have 'age' , 'gender' and ' income' information

In [2]:
profile.head()

Unnamed: 0,gender,age,id,became_member_on,income
0,,118,68be06ca386d4c31939f3a4f0e3dd783,20170212,
1,F,55,0610b486422d4921ae7d2bf64640c50b,20170715,112000.0
2,,118,38fe809add3b4fcf9315a9694bb96ff5,20180712,
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,20170509,100000.0
4,,118,a03223e636434f42ac4c3df47e8bac43,20170804,


In [None]:
print('Number of users: ' , profile.shape[0])
print('Number of user listed more than one time:' , (profile.shape[0] - len(profile['id'].unique())))

In [None]:
user_age = profile.groupby('age')['id'].count()
print ('Higher value in age feature:' , user_age[-1:])
print ('Number of NaN values:' , profile["age"].isna().sum())
print ('Note: There are 2175 users with a value of 118 - we can assume that the user did not provide this information.')

plt.bar(user_age.index , user_age);

In [None]:
user_became_member = profile.groupby('became_member_on')['id'].count()
print('first user' , user_became_member[:1])
print('Last user' , user_became_member[-1:])
print ('Number of NaN values:' , profile["became_member_on"].isna().sum())
print ('Note: change type INT to datetime?.')

In [None]:
user_gender = profile.groupby('gender')['id'].count()
print(user_gender)
print ('Number of NaN values:' , profile["gender"].isna().sum())
print ('Note: there are also 2175 users without information in "gender", the same number of users that have 118 years, will they be the same users?')
print ('users with 118 years and no information in gender:' , profile[profile['age'] == 118]["gender"].isna().sum())
print('yes, the users that have 118 years are the same that does not have gender information, more reason to assume that the user did not give this information.')

In [None]:
user_income = profile.groupby('income')['id'].count()
print('Range income: from' ,  user_income.index.min() , 'to' , user_income.index.max())
print ('Number of NaN values:' , profile["income"].isna().sum())
print('again 2175 nan values')
print ('users with 118 years and no information in income:' , profile[profile['age'] == 118]["income"].isna().sum())
user_income.plot();

# PROMOTIONS / OFFER

portfolio DataFrame analysis conclusions:
* There are 10 offers
* There are 3 types of offers: bogo , informational and discount
* There are no Nan values

In [None]:
portfolio

In [None]:
print('Number of offers: ' , portfolio.shape[0])
print('Number of offers listed more than one time:' , (portfolio.shape[0] - len(portfolio['id'].unique())))

In [None]:
offer_channel_lst = portfolio.channels.tolist()
unique_offer_channel_lst = []
for x in offer_channel_lst: 
        # check if exists in unique_list or not 
        if x not in unique_offer_channel_lst: 
            unique_offer_channel_lst.append(x)
print(unique_offer_channel_lst)
print('there are 4 combinations of channels to interact with the offer, "email" is in all of them')

In [None]:
ordered_portfolio = portfolio.sort_values(by=['difficulty' ,'duration' , 'reward'])
ordered_portfolio

# ACTIONS / EVENT

transcript DataFrame analysis conclusions:

**event**

* offer completed     33579
* offer received      76277
* offer viewed        57725
* transaction        138953


* 75.68% offers sent were viewed
* 58.17% offers viewed were completed

**users**
* All 17 000 users are in the dataframe
* 16 578 users have made at least one transaction

trensaction
* Transaction event is not realted with an offer event, thus we could split the feature value into 'offer id' and 'amount'. 

offer completed
* offer completed has a value of 'offer_id' and 'reward' note: there is a difference in the name 'offer_id' ad 'offer id'

The 'amount' of the transaction could be added to the profie of the user by number of transactions and total amount, to know how much an user spend and to have an average of how much s/he spend.

In [None]:
transcript.head()

In [None]:
print('Number of events:' , transcript.shape[0])
print('Number of users:' , len(transcript.groupby('person')['event'].count()))

In [None]:
event = transcript.groupby('event')['person'].count()
event

In [None]:
print('porcentage offer viewed/recived:' , (event.loc['offer viewed'] * 100) / event.loc['offer received'] , '%')
print('porcentage offer completed/recived:' , (event.loc['offer completed'] * 100) / event.loc['offer received'] , '%')
print('porcentage offer completed/viewed:' , (event.loc['offer completed'] * 100) / event.loc['offer viewed'] , '%')

In [None]:
transaction = transcript[transcript['event'] == 'transaction']
transaction.head()

In [None]:
print('Number of users that made at least one transaction:' , len(transaction.groupby('person')['event'].count()))

In [None]:
transcript[transcript['event'] == 'offer received'].head()

In [None]:
transcript[transcript['event'] == 'offer viewed'].head()

In [None]:
transcript[transcript['event'] == 'offer completed'].head()