# Questions for Board Meeting
### MVP 
1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
7. Which lessons are least accessed?
2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?

### If there's time
4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?
3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students

In [33]:
import warnings
warnings.filterwarnings("ignore")

import os
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# DBSCAN import
from sklearn.cluster import DBSCAN

# Scaler import
from sklearn.preprocessing import MinMaxScaler

from env import host, user, password

## Aquire

In [36]:
url = f'mysql+pymysql://{user}:{password}@{host}/curriculum_logs'

In [37]:
sql_query = query = '''
SELECT date, time, path, user_id, cohort_id, program_id, ip, name, slack, start_date, end_date, created_at, updated_at
FROM logs
JOIN cohorts on logs.cohort_id = cohorts.id
'''

In [40]:
if os.path.isfile('logs.csv'):

    # If csv file exists, read in data from csv file.
    df = pd.read_csv('logs.csv', index_col=0)

else:

    # Read fresh data from db into a DataFrame.
    df = pd.read_sql(sql_query, url)
    
    # Write DataFrame to a csv
    df.to_csv('logs.csv')

In [41]:
df.head()

Unnamed: 0,date,time,path,user_id,cohort_id,program_id,ip,name,slack,start_date,end_date,created_at,updated_at
0,2018-01-26,09:55:03,/,1,8.0,1,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26
1,2018-01-26,09:56:02,java-ii,1,8.0,1,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,1,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,1,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,2,97.105.19.61,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10


In [42]:
df.shape

(847330, 13)

## Prepare

In [43]:
# let's restart this
conditions = [df.program_id == 1, df.program_id == 2, df.program_id == 3, df.program_id == 4]
result = ['web_dev','web_dev','data_science','web_dev']
df['program'] = np.select(conditions, result)
df.head()

Unnamed: 0,date,time,path,user_id,cohort_id,program_id,ip,name,slack,start_date,end_date,created_at,updated_at,program
0,2018-01-26,09:55:03,/,1,8.0,1,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,web_dev
1,2018-01-26,09:56:02,java-ii,1,8.0,1,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,web_dev
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,1,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,web_dev
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,1,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,web_dev
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,2,97.105.19.61,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,web_dev


## 1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
- Break down lessons in path by filtering by defining a path that belongs to program and then lesson
    - groupby path and count
    - assign path to label. if path == ??? then column label is lesson

In [74]:
df_ds = df[df.program=='data_science']
df_ds.head()

Unnamed: 0,date,time,path,user_id,cohort_id,program_id,ip,name,slack,start_date,end_date,created_at,updated_at,program
300225,2019-08-20,09:39:58,/,466,34.0,3,97.105.19.58,Bayes,#,2019-08-19,2020-01-30,2019-08-20 14:38:55,2019-08-20 14:38:55,data_science
300226,2019-08-20,09:39:59,/,467,34.0,3,97.105.19.58,Bayes,#,2019-08-19,2020-01-30,2019-08-20 14:38:55,2019-08-20 14:38:55,data_science
300227,2019-08-20,09:39:59,/,468,34.0,3,97.105.19.58,Bayes,#,2019-08-19,2020-01-30,2019-08-20 14:38:55,2019-08-20 14:38:55,data_science
300228,2019-08-20,09:40:02,/,469,34.0,3,97.105.19.58,Bayes,#,2019-08-19,2020-01-30,2019-08-20 14:38:55,2019-08-20 14:38:55,data_science
300229,2019-08-20,09:40:08,/,470,34.0,3,97.105.19.58,Bayes,#,2019-08-19,2020-01-30,2019-08-20 14:38:55,2019-08-20 14:38:55,data_science


In [92]:
df_ds.groupby(by=['path'])[['user_id']].agg('count').sort_values('user_id', ascending=False)[:40]

Unnamed: 0_level_0,user_id
path,Unnamed: 1_level_1
/,8358
search/search_index.json,2203
classification/overview,1785
1-fundamentals/modern-data-scientist.jpg,1655
1-fundamentals/AI-ML-DL-timeline.jpg,1651
1-fundamentals/1.1-intro-to-data-science,1633
classification/scale_features_or_not.svg,1590
fundamentals/AI-ML-DL-timeline.jpg,1443
fundamentals/modern-data-scientist.jpg,1438
sql/mysql-overview,1424


In [75]:
df_web = df[df.program=='web_dev']
df_web.head()

Unnamed: 0,date,time,path,user_id,cohort_id,program_id,ip,name,slack,start_date,end_date,created_at,updated_at,program
0,2018-01-26,09:55:03,/,1,8.0,1,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,web_dev
1,2018-01-26,09:56:02,java-ii,1,8.0,1,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,web_dev
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,1,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,web_dev
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,1,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,web_dev
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,2,97.105.19.61,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,web_dev


In [72]:
df.groupby(by=['program', 'path'])['user_id'].agg('count')

program       path                            
data_science  %20https://github.com/RaulCPena        1
              ,%20https://github.com/RaulCPena       1
              .gitignore                             1
              /                                   8358
              1-fundamentals                        10
                                                  ... 
web_dev       web-design/ux/purpose                255
              web-dev-day-two                        2
              working-with-time-series-data          1
              wp-admin                               1
              wp-login                               1
Name: user_id, Length: 2736, dtype: int64

In [64]:
df.groupby(by=['path'])['program','name'].agg('count').sort_values(['program','name'], ascending=False)[1:20]

Unnamed: 0_level_0,program,name
path,Unnamed: 1_level_1,Unnamed: 2_level_1
javascript-i,18203,18203
toc,17591,17591
search/search_index.json,17534,17534
java-iii,13166,13166
html-css,13127,13127
java-ii,12177,12177
spring,11883,11883
jquery,11041,11041
mysql,10611,10611
java-i,10467,10467


# Code from Bonus: Identify users who are viewing both the web dev and data science curriculum 

In [None]:
# find data science students that have logs for web deb pages
# subset df to data_science only
df_data_science = df[df.program_name=='data_science']
df_data_science.head()

In [None]:
# find data science student endpoints that contain java or html
df_data_science.endpoint.str.contains(pat = 'html|java', case=False, regex=True).sum() # Too few to continue

In [None]:
# make list of data science endpoints
ds_endpoints = df_data_science.endpoint.unique()
ds_endpoints = pd.Series(ds_endpoints)
ds_endpoints

In [None]:
ds_endpoints.str.contains('java|html', case=False, regex=True)

In [None]:
# find web dev students with endpoints of data science endpoints
df_web_dev = df[df.program_name=='web_dev']
df_web_dev.head()

In [None]:
df_web_dev[df_web_dev.endpoint.isin(ds_endpoints)] # 234K observation of web dev have 