In [1]:
import pandas as pd
import numpy as np
import acquire as a
from env import get_db_url

import prepare as p

### Day 1 Wrangling the dataframe

---

In [2]:
# acquire data
df = a.get_log_data()

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900223 entries, 0 to 900222
Data columns (total 16 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Unnamed: 0  900223 non-null  int64  
 1   date        900223 non-null  object 
 2   time        900223 non-null  object 
 3   path        900222 non-null  object 
 4   user_id     900223 non-null  int64  
 5   cohort_id   847330 non-null  float64
 6   ip          900223 non-null  object 
 7   id          847330 non-null  float64
 8   name        847330 non-null  object 
 9   slack       847330 non-null  object 
 10  start_date  847330 non-null  object 
 11  end_date    847330 non-null  object 
 12  created_at  847330 non-null  object 
 13  updated_at  847330 non-null  object 
 14  deleted_at  0 non-null       float64
 15  program_id  847330 non-null  float64
dtypes: float64(4), int64(2), object(10)
memory usage: 109.9+ MB


In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,date,time,path,user_id,cohort_id,ip,id,name,slack,start_date,end_date,created_at,updated_at,deleted_at,program_id
0,0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61,8.0,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
1,1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61,8.0,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
2,2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,8.0,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
3,3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,8.0,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
4,4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,22.0,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,,2.0


In [5]:
# set datetime index
df.set_index('date', inplace=True)

In [6]:
# drop 
df = df.drop(columns=['Unnamed: 0', 'deleted_at'])

In [7]:
df.head()

Unnamed: 0_level_0,time,path,user_id,cohort_id,ip,id,name,slack,start_date,end_date,created_at,updated_at,program_id
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2018-01-26,09:55:03,/,1,8.0,97.105.19.61,8.0,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1.0
2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61,8.0,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1.0
2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,8.0,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1.0
2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,8.0,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1.0
2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,22.0,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,2.0


### Moving forward we will prepare the dataframe in a way that is formatted to answer all possible questions.

### This dataframe will:

- drops: 
    - "slack"
    - "id"
    - "deleted_id"
    - "unnamed_0" 
Columns

- Changes 
    - start_date
    - end_date
    - create_at
    - updated_at

Alters the "date" column to include the "time" component and create the date_time column.
Then drop unnecessary columns.


---

# Exploring the Data

### Pulling the inital prepared dataframe

In [8]:
# prepare_logs function found in prepare.py
df = p.prepare_logs()
df = p.get_q6_eda_df()

In [9]:
# sanity check
df.head()

Unnamed: 0,path,user_id,cohort_id,ip,name,start_date,end_date,created_at,updated_at,program_id,date_time
0,/,1,8.0,97.105.19.61,hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,full_stack_php,2018-01-26 09:55:03
1,java-ii,1,8.0,97.105.19.61,hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,full_stack_php,2018-01-26 09:56:02
2,java-ii/object-oriented-programming,1,8.0,97.105.19.61,hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,full_stack_php,2018-01-26 09:56:05
3,slides/object_oriented_programming,1,8.0,97.105.19.61,hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,full_stack_php,2018-01-26 09:56:06
4,javascript-i/conditionals,2,22.0,97.105.19.61,teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,full_stack_java,2018-01-26 09:56:24


---

_"Email to analyst:_


_Hello,_


_I have some questions for you that I need to be answered before the board meeting Thursday afternoon. I need to be able to speak to the following questions. I also need a single slide that I can incorporate into my existing presentation (Google Slides) that summarizes the most important points. My questions are listed below; however, if you discover anything else important that I didn’t think to ask, please include that as well."_

### Which lesson appears to attract the most traffic consistently across cohorts (per program)?

First I will drop nulls so that all cohorts match with a path (lesson)

In [10]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 847329 entries, 0 to 900222
Data columns (total 11 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   path        847329 non-null  object        
 1   user_id     847329 non-null  int64         
 2   cohort_id   847329 non-null  float64       
 3   ip          847329 non-null  object        
 4   name        847329 non-null  object        
 5   start_date  847329 non-null  datetime64[ns]
 6   end_date    847329 non-null  datetime64[ns]
 7   created_at  847329 non-null  datetime64[ns]
 8   updated_at  847329 non-null  datetime64[ns]
 9   program_id  847329 non-null  object        
 10  date_time   847329 non-null  datetime64[ns]
dtypes: datetime64[ns](5), float64(1), int64(1), object(4)
memory usage: 77.6+ MB


##### Assuming that if it is listed in this dataframe, it has recieved traffic as all values have a start and end date. 

How can I answer this question?
- Group by Cohort
- Identify each program
- Get a count of the program per cohort

### First, let's create dataframes for each program.

In [11]:
df.program_id.value_counts()

full_stack_java          713365
data_science             103411
full_stack_php            30548
front_end_programming         5
Name: program_id, dtype: int64

- full_stack_java
- data_science
- full_stack_php
- front_end_programming

#### Full Stack Java == fsj

In [12]:
# full stack java dataframe
fsj  = df[df.program_id == 'full_stack_java']
fsj.head()

Unnamed: 0,path,user_id,cohort_id,ip,name,start_date,end_date,created_at,updated_at,program_id,date_time
4,javascript-i/conditionals,2,22.0,97.105.19.61,teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,full_stack_java,2018-01-26 09:56:24
5,javascript-i/loops,2,22.0,97.105.19.61,teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,full_stack_java,2018-01-26 09:56:41
6,javascript-i/conditionals,3,22.0,97.105.19.61,teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,full_stack_java,2018-01-26 09:56:46
7,javascript-i/functions,3,22.0,97.105.19.61,teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,full_stack_java,2018-01-26 09:56:48
8,javascript-i/loops,2,22.0,97.105.19.61,teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,full_stack_java,2018-01-26 09:56:59


In [34]:
cohort = fsj.name.unique().tolist()
for c in cohort:
    fsj = df[df.name == c]
    

In [35]:
fsj

Unnamed: 0,path,user_id,cohort_id,ip,name,start_date,end_date,created_at,updated_at,program_id,date_time
886729,/,954,139.0,72.177.209.77,oberon,2021-04-12,2021-10-01,2021-04-12 18:07:21,2021-04-12 18:07:21,full_stack_java,2021-04-12 16:44:17
886730,/,955,139.0,70.121.220.245,oberon,2021-04-12,2021-10-01,2021-04-12 18:07:21,2021-04-12 18:07:21,full_stack_java,2021-04-12 16:44:17
886731,/,956,139.0,162.200.114.251,oberon,2021-04-12,2021-10-01,2021-04-12 18:07:21,2021-04-12 18:07:21,full_stack_java,2021-04-12 16:44:18
886732,/,957,139.0,76.185.197.205,oberon,2021-04-12,2021-10-01,2021-04-12 18:07:21,2021-04-12 18:07:21,full_stack_java,2021-04-12 16:44:18
886733,/,958,139.0,69.239.143.192,oberon,2021-04-12,2021-10-01,2021-04-12 18:07:21,2021-04-12 18:07:21,full_stack_java,2021-04-12 16:44:18
...,...,...,...,...,...,...,...,...,...,...,...
899851,javascript-i/testing-user-functions,969,139.0,107.77.169.64,oberon,2021-04-12,2021-10-01,2021-04-12 18:07:21,2021-04-12 18:07:21,full_stack_java,2021-04-21 12:13:18
899865,javascript-i,969,139.0,107.77.169.64,oberon,2021-04-12,2021-10-01,2021-04-12 18:07:21,2021-04-12 18:07:21,full_stack_java,2021-04-21 12:31:48
899866,javascript-i/javascript-with-html,969,139.0,107.77.169.64,oberon,2021-04-12,2021-10-01,2021-04-12 18:07:21,2021-04-12 18:07:21,full_stack_java,2021-04-21 12:31:58
899867,javascript-i/testing-user-functions,969,139.0,107.77.169.64,oberon,2021-04-12,2021-10-01,2021-04-12 18:07:21,2021-04-12 18:07:21,full_stack_java,2021-04-21 12:32:01


#### data science == ds

In [13]:
ds = df[df.program_id == 'data_science']
ds.head()

Unnamed: 0,path,user_id,cohort_id,ip,name,start_date,end_date,created_at,updated_at,program_id,date_time
326053,/,466,34.0,97.105.19.58,bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,2019-08-20 14:38:55,data_science,2019-08-20 09:39:58
326054,/,467,34.0,97.105.19.58,bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,2019-08-20 14:38:55,data_science,2019-08-20 09:39:59
326055,/,468,34.0,97.105.19.58,bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,2019-08-20 14:38:55,data_science,2019-08-20 09:39:59
326056,/,469,34.0,97.105.19.58,bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,2019-08-20 14:38:55,data_science,2019-08-20 09:40:02
326057,/,470,34.0,97.105.19.58,bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,2019-08-20 14:38:55,data_science,2019-08-20 09:40:08


In [23]:
ds.name.value_counts()

darden      32015
bayes       26538
curie       21581
easley      14715
florence     8562
Name: name, dtype: int64

#### full stack php = fsp

In [14]:
fsp = df[df.program_id == 'full_stack_php']
fsp.head()

Unnamed: 0,path,user_id,cohort_id,ip,name,start_date,end_date,created_at,updated_at,program_id,date_time
0,/,1,8.0,97.105.19.61,hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,full_stack_php,2018-01-26 09:55:03
1,java-ii,1,8.0,97.105.19.61,hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,full_stack_php,2018-01-26 09:56:02
2,java-ii/object-oriented-programming,1,8.0,97.105.19.61,hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,full_stack_php,2018-01-26 09:56:05
3,slides/object_oriented_programming,1,8.0,97.105.19.61,hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,full_stack_php,2018-01-26 09:56:06
30,/,11,1.0,97.105.19.61,arches,2014-02-04,2014-04-22,2016-06-14 19:52:26,2016-06-14 19:52:26,full_stack_php,2018-01-26 10:14:47


#### front end programming = fep

In [15]:
fep = df[df.program_id == 'front_end_programming']
fep

Unnamed: 0,path,user_id,cohort_id,ip,name,start_date,end_date,created_at,updated_at,program_id,date_time
31627,/,152,9.0,207.68.209.17,apollo,2015-03-30,2015-07-29,2016-06-14 19:52:26,2016-06-14 19:52:26,front_end_programming,2018-03-22 19:01:49
31628,content/html-css,152,9.0,207.68.209.17,apollo,2015-03-30,2015-07-29,2016-06-14 19:52:26,2016-06-14 19:52:26,front_end_programming,2018-03-22 19:01:54
31629,content/html-css/gitbook/images/favicon.ico,152,9.0,207.68.209.17,apollo,2015-03-30,2015-07-29,2016-06-14 19:52:26,2016-06-14 19:52:26,front_end_programming,2018-03-22 19:01:54
31630,content/html-css,152,9.0,207.68.209.17,apollo,2015-03-30,2015-07-29,2016-06-14 19:52:26,2016-06-14 19:52:26,front_end_programming,2018-03-22 19:02:47
31631,content/html-css/introduction.html,152,9.0,207.68.209.17,apollo,2015-03-30,2015-07-29,2016-06-14 19:52:26,2016-06-14 19:52:26,front_end_programming,2018-03-22 19:02:52


### Now I'll move into identifying the path value counts, and answering the question. 

#### Full Stack Java

In [16]:
fsj_test = pd.DataFrame(fsj.groupby('path').filter(lambda x : len(x)>10000))
fsj_test.path.value_counts()

/                           35814
javascript-i                17457
toc                         17428
search/search_index.json    15212
java-iii                    12683
html-css                    12569
java-ii                     11719
spring                      11376
jquery                      10693
mysql                       10318
java-i                      10016
Name: path, dtype: int64

#### data science

In [17]:
ds_test = pd.DataFrame(ds.groupby('path').filter(lambda x: len(x)>600))
ds_test.path.value_counts()

/                                                    8358
search/search_index.json                             2203
classification/overview                              1785
1-fundamentals/modern-data-scientist.jpg             1655
1-fundamentals/AI-ML-DL-timeline.jpg                 1651
1-fundamentals/1.1-intro-to-data-science             1633
classification/scale_features_or_not.svg             1590
fundamentals/AI-ML-DL-timeline.jpg                   1443
fundamentals/modern-data-scientist.jpg               1438
sql/mysql-overview                                   1424
fundamentals/intro-to-data-science                   1413
6-regression/1-overview                              1124
anomaly-detection/AnomalyDetectionCartoon.jpeg        829
anomaly-detection/overview                            804
10-anomaly-detection/AnomalyDetectionCartoon.jpeg     754
10-anomaly-detection/1-overview                       751
3-sql/1-mysql-overview                                707
1-fundamentals

#### full stack php

In [18]:
fsp.path.value_counts()
fsp_test = pd.DataFrame(fsp.groupby('path').filter(lambda x: len(x)>250))
fsp_test.path.value_counts()

/                   1681
index.html          1011
javascript-i         736
html-css             542
spring               501
java-iii             479
java-ii              454
java-i               444
javascript-ii        429
appendix             409
jquery               344
mysql                284
content/html-css     262
Name: path, dtype: int64

#### front end programming

In [19]:
fep.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 31627 to 31631
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   path        5 non-null      object        
 1   user_id     5 non-null      int64         
 2   cohort_id   5 non-null      float64       
 3   ip          5 non-null      object        
 4   name        5 non-null      object        
 5   start_date  5 non-null      datetime64[ns]
 6   end_date    5 non-null      datetime64[ns]
 7   created_at  5 non-null      datetime64[ns]
 8   updated_at  5 non-null      datetime64[ns]
 9   program_id  5 non-null      object        
 10  date_time   5 non-null      datetime64[ns]
dtypes: datetime64[ns](5), float64(1), int64(1), object(4)
memory usage: 480.0+ bytes


In [20]:
fep.path.value_counts()

content/html-css                               2
/                                              1
content/html-css/gitbook/images/favicon.ico    1
content/html-css/introduction.html             1
Name: path, dtype: int64

##### Front end programming, Full stack php, and full stack java all have html-css at the top of their most visited. Although it isn't in datascience top 10, let's see if there is a significant amount of DS cohorts who look at html-css

In [21]:
ds1 = ds[ds.path == 'html-css']
ds1.path.value_counts()

html-css    16
Name: path, dtype: int64

### Takeawy:

As front end programming only has html-css classes and there are html-css classes for each program across all cohorts, it's simple enough to say that html-css is a lesson that has consistency across all programs. 

However, as the front end programming program has an extremely small sample size, I think it should be removed. If this happens, between the other programs, SQL would have the most traffic consitently across all cohorts per program.

_Pending Question..._

Should cohorts also be controlled for? (Controll for cohorts -> Program)