# Profiles and profile programs table analysis

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Profiles" data-toc-modified-id="Profiles-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Profiles</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span></li></ul></li><li><span><a href="#Program-profiles" data-toc-modified-id="Program-profiles-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Program profiles</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Introduction</a></span></li></ul></li></ul></div>

## Profiles

### Introduction

In this document there will be statistical analysis of Profiles table. 
Profiles table has 11 variables (columns) and 36 records. 
For the analysis below libraries will be used:

* pandas
* numpy
* statistics
* matplotlib.pyplot
* seaborn
* pingouin
* distfit

Below there are information and glimpse of this dataset. 

In [51]:
import pandas as pd
import numpy as np
import statistics
import matplotlib.pyplot as plt
import seaborn as sns
import math
import pingouin as pg
import statsmodels.api as sm
import matplotlib as mpl
from scipy import stats
from distfit import distfit

pd.options.display.float_format = '{:.2f}'.format

import warnings
warnings.filterwarnings('ignore')

plt.rcParams["image.cmap"] = "Pastel2"

In [35]:
profiles = pd.read_csv('/home/evida-monika/mhunters/profiles.csv')

In [36]:
body_type = []

for i in range(0, len(profiles['gender'])):
    if (profiles['gender'][i] == 0 and profiles['min_fat_level'][i] == 30):
        body_type.append('Gordo')
    elif (profiles['gender'][i] == 0 and profiles['min_fat_level'][i] == 0):
        body_type.append('Delgado')
    elif (profiles['gender'][i] == 1 and profiles['min_fat_level'][i] == 25):
        body_type.append('Gordo')
    elif (profiles['gender'][i] == 1 and profiles['min_fat_level'][i] == 0):
        body_type.append('Delgado')

profiles['body_type'] = body_type

               
cols = ['gender', 'activity_level', 'goal', 'max_fat_level', 'min_fat_level', 'body_type']

for col in cols:
    profiles[col] = profiles[col].astype('category')
    

In [37]:
profiles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   id              36 non-null     int64   
 1   gender          36 non-null     category
 2   activity_level  36 non-null     category
 3   goal            36 non-null     category
 4   fat_level       0 non-null      float64 
 5   name            36 non-null     object  
 6   created_at      36 non-null     object  
 7   updated_at      36 non-null     object  
 8   max_fat_level   36 non-null     category
 9   min_fat_level   36 non-null     category
 10  body_type       36 non-null     category
dtypes: category(6), float64(1), int64(1), object(3)
memory usage: 2.5+ KB


In [38]:
profiles.head()

Unnamed: 0,id,gender,activity_level,goal,fat_level,name,created_at,updated_at,max_fat_level,min_fat_level,body_type
0,19,0,0,1,,35 - Female | Very Active | Gain | Delgada,2020-11-27 13:19:42.082901,2021-09-15 11:11:31.960274,29.99,0,Delgado
1,20,0,0,1,,36 - Female | Very Active | Gain | Gorda,2020-11-27 13:20:03.833177,2021-09-15 11:11:40.131357,100.0,30,Gordo
2,15,0,0,0,,31 - Female | Very Active | Lose | Delgada,2020-11-27 13:17:57.753404,2021-09-15 11:11:49.379382,29.99,0,Delgado
3,16,0,0,0,,32 - Female | Very Active | Lose | Gorda,2020-11-27 13:18:14.413154,2021-09-15 11:11:58.101087,100.0,30,Gordo
4,28,1,1,1,,12 - Male | Active | Gain | Gordo,2020-11-27 13:24:12.069931,2021-09-15 10:47:46.336885,100.0,25,Gordo


Columns

* *fat_level* (all NULL values),
* *name* (variable containing data from other columns),
* *created_at*, *updated_at* (when were the profiles created/updated) 

will be deleted, because they are not important for the analysis.

The variables taken into consideration are:

* *gender* (0 - female, 1 - male),
* *activity_level* (0 – very active, 1 – active, 2 - sedentary),
* *goal* (0 – lose, 1 – gain),
* *body_type* (Delgado and Gordo, added variable from min/max fat level), 
* *max_fat_level*,
* *min_fat_level*.

The table describes profile that every user can have. Below there is a summary of chosen data.

In [42]:
profiles2 = profiles.drop(['fat_level', 'name', 'created_at', 'updated_at'], axis = 1)

In [43]:
cat_names = {
    'gender': {1: 'male', 0: 'female'},
    'activity_level': {0: 'very active', 1: 'active', 2: 'sedentary'},
    'goal': {0: 'lose', 1: 'gain', 2: 'antiaging'}
}

profiles2 = profiles2.replace(cat_names)

cols = ['gender', 'activity_level', 'goal', 'max_fat_level', 'min_fat_level', 'body_type']

for col in cols:
    profiles2[col] = profiles2[col].astype('category')

In [44]:
profiles2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   id              36 non-null     int64   
 1   gender          36 non-null     category
 2   activity_level  36 non-null     category
 3   goal            36 non-null     category
 4   max_fat_level   36 non-null     category
 5   min_fat_level   36 non-null     category
 6   body_type       36 non-null     category
dtypes: category(6), int64(1)
memory usage: 1.4 KB


The table contains information about the profiles so every variable is equally split into 2 of 3, for example gender is split into 2 (equally 18 for male and 18 for female). Because of this, there is no need for analysis of frequency tables.

## Program profiles

### Introduction

Program table has 5 variables and 628 rows. They variables are:

* *id*,
* *program_id* - ID number for the program,
* *profile_id* - ID number for the profile,
* *created_at*, *updated_at*.

Below there are information and glimpse of the data.

In [52]:
program_profiles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 628 entries, 0 to 627
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          628 non-null    int64 
 1   program_id  628 non-null    int64 
 2   profile_id  628 non-null    int64 
 3   created_at  628 non-null    object
 4   updated_at  628 non-null    object
dtypes: int64(3), object(2)
memory usage: 24.7+ KB


In [50]:
program_profiles = pd.read_csv('/home/evida-monika/mhunters/program_profiles.csv', on_bad_lines='skip', low_memory=False)

program_profiles.head()

Unnamed: 0,id,program_id,profile_id,created_at,updated_at
0,1,34,1,2020-11-23 16:10:14.679167,2020-11-23 16:10:14.679167
1,2,28,20,2020-11-27 13:41:47.642457,2020-11-27 13:41:47.642457
2,3,28,19,2020-11-27 13:41:47.648063,2020-11-27 13:41:47.648063
3,4,27,17,2020-11-27 13:43:07.059832,2020-11-27 13:43:07.059832
4,5,27,15,2020-11-27 13:43:07.065767,2020-11-27 13:43:07.065767


It is possible to connect the tables *programs* and *profiles* by this *programs_profiles* table and it is done below. 

In [53]:
programs = pd.read_csv('/home/evida-monika/mhunters/programs.csv', on_bad_lines='skip', low_memory=False)
programs2 = programs.drop(['user_id', 'code_name', 'name_es', 
                           'description_es', 'auto_generated', 'priority_order', 
                           'next_program_id'], axis = 1)


programs2.replace('t', 'True', inplace = True)
programs2.replace('f', 'False', inplace = True)

programs2['pro'] = programs2['pro'].astype('category')
programs2['available'] = programs2['available'].astype('category')

col_date = ['created_at', 'updated_at']

for col in col_date:
    programs2[col] = pd.to_datetime(programs2[col])#.dt.strftime("%Y-%m-%d %H:%M:%S")


In [58]:
pr_pr1 = program_profiles.merge(programs2, how = 'left', left_on = 'program_id', right_on = 'id')

pr_pr1.rename(columns = {'id_x': 'id_program_profiles',
                         'created_at_x': 'created_at_program_profiles',
                         'updated_at_x': 'updated_at_program_profiles',
                         'id_y': 'id_programs',
                         'created_at_y': 'created_at_programs',
                         'updated_at_y': 'updated_at_programs'},
             inplace = True)

In [60]:
pr_pr = pr_pr1.merge(profiles, how = 'left', left_on = 'profile_id', right_on = 'id')

pr_pr.rename(columns = {'id': 'id_profile',
                        'created_at': 'created_at_profiles',
                        'updated_at': 'updated_at_profiles'})

pr_pr.head()

Unnamed: 0,id_program_profiles,program_id,profile_id,created_at_program_profiles,updated_at_program_profiles,id_programs,created_at_programs,updated_at_programs,pro,available,...,gender,activity_level,goal,fat_level,name,created_at,updated_at,max_fat_level,min_fat_level,body_type
0,1,34,1,2020-11-23 16:10:14.679167,2020-11-23 16:10:14.679167,34,2020-11-23 14:15:43.775009,2021-09-29 14:54:09.125824,True,True,...,1,2,0,,1 - Male | Sedentary | Lose | Delgado,2020-11-23 16:10:01.566607,2021-09-15 10:27:35.753055,24.99,0,Delgado
1,2,28,20,2020-11-27 13:41:47.642457,2020-11-27 13:41:47.642457,28,2020-11-23 14:05:56.089703,2021-09-29 14:53:42.661195,False,True,...,0,0,1,,36 - Female | Very Active | Gain | Gorda,2020-11-27 13:20:03.833177,2021-09-15 11:11:40.131357,100.0,30,Gordo
2,3,28,19,2020-11-27 13:41:47.648063,2020-11-27 13:41:47.648063,28,2020-11-23 14:05:56.089703,2021-09-29 14:53:42.661195,False,True,...,0,0,1,,35 - Female | Very Active | Gain | Delgada,2020-11-27 13:19:42.082901,2021-09-15 11:11:31.960274,29.99,0,Delgado
3,4,27,17,2020-11-27 13:43:07.059832,2020-11-27 13:43:07.059832,27,2020-11-23 14:05:30.101469,2021-09-29 14:53:35.625344,False,True,...,0,0,2,,33 - Female | Very Active | Antiaging | Delgada,2020-11-27 13:18:58.145329,2021-09-15 11:11:15.326568,29.99,0,Delgado
4,5,27,15,2020-11-27 13:43:07.065767,2020-11-27 13:43:07.065767,27,2020-11-23 14:05:30.101469,2021-09-29 14:53:35.625344,False,True,...,0,0,0,,31 - Female | Very Active | Lose | Delgada,2020-11-27 13:17:57.753404,2021-09-15 11:11:49.379382,29.99,0,Delgado
