### Questions for Board Meeting
### MVP
- 1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
- 2. Which lessons are least accessed?
- 3. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
- 4. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
- 5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?

### Followup quesitons

- 1. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?
- 2. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students

In [1]:
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import seaborn as sns
import env
import os

from env import host, user, password

# DBSCAN import
from sklearn.cluster import DBSCAN

# Scaler import
from sklearn.preprocessing import MinMaxScaler

import acquire
import prepare

In [2]:
#acquire sql query
df = acquire.acquire()

In [3]:
# see prepare for cleaning
df = prepare.clean_cohort(df)

In [4]:
# splits our df into a webdev and data science
df_wd, df_ds = prepare.program_split(df)

In [5]:
df_wd['url'].value_counts()

javascript-i                            107525
html-css                                 76142
mysql                                    73732
jquery                                   54080
spring                                   49907
                                         ...  
10.03_Explore                                1
10.04.01_FeatureExtraction_FreqBased         1
10.04.02_FeatureExtraction_Word2Vec          1
10.04.03_SentimentAnalysis                   1
selectors                                    1
Name: url, Length: 360, dtype: int64

In [6]:
wd_hits = df_wd['url'].value_counts().loc[lambda x : x>100]

In [7]:
# pulled from Kan - will throw into prepare
df_ds['lesson'] = np.where(df_ds.path.str.contains('appendix'), 'appendix',
np.where(df_ds.path.str.contains('search'), 'search',
np.where(df_ds.path.str.contains('classification'),'classification',
np.where(df_ds.path.str.contains('sql'), 'sql',
np.where(df_ds.path.str.contains('fundamentals'), 'fundamentals',
np.where(df_ds.path.str.contains('regression'), 'regression',
np.where(df_ds.path.str.contains('python'), 'python',
np.where(df_ds.path.str.contains('stats'), 'stats', 
np.where(df_ds.path.str.contains('anomaly'), 'anomaly',
np.where(df_ds.path.str.contains('clustering'), 'clustering',
np.where(df_ds.path.str.contains('nlp'), 'nlp',
np.where(df_ds.path.str.contains('timeseries'), 'time_series',
np.where(df_ds.path.str.contains('distributed-ml'), 'distributed_ml''',
np.where(df_ds.path.str.contains('storytelling'), 'storytelling',

np.where(df_ds.path.str.contains('advanced-topics'), 'advanced-topics',
np.where(df_ds.path.str.contains('capstones'), 'capstones',
'pending'))))))))))))))))
df_ds.lesson.unique()

array(['pending', 'sql', 'storytelling', 'appendix', 'fundamentals',
       'search', 'advanced-topics', 'regression', 'anomaly', 'nlp',
       'classification', 'clustering', 'time_series', 'stats', 'python',
       'distributed_ml', 'capstones'], dtype=object)

In [8]:
# pulled from Kan - will throw into prepare
df_ds['url2'] = np.where(df_ds.url.str.contains('appendix'), 'appendix',
np.where(df_ds.url.str.contains('search'), 'search',
np.where(df_ds.url.str.contains('classification'),'classification',
np.where(df_ds.url.str.contains('sql'), 'sql',
np.where(df_ds.url.str.contains('fundamentals'), 'fundamentals',
np.where(df_ds.url.str.contains('regression'), 'regression',
np.where(df_ds.url.str.contains('python'), 'python',
np.where(df_ds.url.str.contains('stats'), 'stats', 
np.where(df_ds.url.str.contains('anomaly'), 'anomaly',
np.where(df_ds.url.str.contains('clustering'), 'clustering',
np.where(df_ds.url.str.contains('nlp'), 'nlp',
np.where(df_ds.url.str.contains('timeseries'), 'time_series',
np.where(df_ds.url.str.contains('distributed-ml'), 'distributed_ml''',
np.where(df_ds.url.str.contains('storytelling'), 'storytelling',

np.where(df_ds.url.str.contains('advanced-topics'), 'advanced-topics',
np.where(df_ds.url.str.contains('capstones'), 'capstones',
'pending'))))))))))))))))
df_ds.url2.unique()

array(['pending', 'sql', 'storytelling', 'appendix', 'fundamentals',
       'search', 'advanced-topics', 'regression', 'anomaly', 'nlp',
       'classification', 'clustering', 'time_series', 'stats', 'python',
       'distributed_ml', 'capstones'], dtype=object)

In [9]:
df_ds[['url', 'url2']].value_counts()

url                          url2          
fundamentals                 fundamentals      8746
classification               classification    8620
                             pending           8358
1-fundamentals               fundamentals      7945
sql                          sql               7505
                                               ... 
index.html                   pending              1
imports                      pending              1
group-by                     pending              1
7.4.2-series                 pending              1
5-detecting-with-clustering  clustering           1
Length: 153, dtype: int64

In [10]:
# created lesson column with values for web dev lessons
df_wd['lesson'] = np.where(df_wd.path.str.contains('search'),'search',
np.where(df_wd.path.str.contains('index'),'index',
np.where(df_wd.path.str.contains('javascript'),'javascript',
np.where(df_wd.path.str.contains('toc'),'toc',
np.where(df_wd.path.str.contains('java'),'java',
np.where(df_wd.path.str.contains('html|css'),'html-css',
np.where(df_wd.path.str.contains('spring'),'spring',
np.where(df_wd.path.str.contains('jquery'),'jquery',
np.where(df_wd.path.str.contains('mysql'),'mysql',
np.where(df_wd.path.str.contains('capstone'),'capstone',
np.where(df_wd.path.str.contains('array|syntax|object_oriented|polymorph|methods|collections|deployment'),'structure',
np.where(df_wd.path.str.contains('php'),'php',
np.where(df_wd.path.str.contains('larvel'),'larvel',
         
'pending')))))))))))))
                                                                                                                                                
df_wd.lesson.unique()

array(['pending', 'java', 'structure', 'javascript', 'search', 'spring',
       'capstone', 'index', 'html-css', 'mysql', 'jquery', 'php', 'toc'],
      dtype=object)

In [11]:
##### so with this our data should be set

### 4. What topics are grads continuing to reference after graduation and into their jobs (for each program)?

In [12]:
### datetime - check timestamps min, max - make dfs that are hits beyond enddate -
### compare those dates by value_counts() look to see if there is anythign interesting

In [13]:
#datetime to end_date
df_wd.end_date = pd.to_datetime(df.end_date)
df_ds.end_date = pd.to_datetime(df.end_date)

In [14]:
#checking timeframe for web_dev
df_wd.end_date.min(), df_wd.end_date.max()

(Timestamp('2014-04-22 00:00:00'), Timestamp('2021-10-01 00:00:00'))

In [15]:
#checking timeframe for datascience
df_ds.end_date.min(), df_ds.end_date.max()

(Timestamp('2020-01-30 00:00:00'), Timestamp('2021-09-03 00:00:00'))

In [16]:
# add to prepare.py
df_wd['datetime'] = df_wd.date
df_wd['datetime'] = pd.to_datetime(df_wd['datetime'])

In [17]:
df_ds['datetime'] = df_ds.date
df_ds['datetime'] = pd.to_datetime(df_ds['datetime'])

In [18]:
#making a dataframe for each 
wd_diff_time = df_wd[df_wd.datetime > df_wd.end_date][['path', 'user_id', 'end_date', 'datetime', 'url', 'lesson']]
ds_diff_time = df_ds[df_ds.datetime > df_ds.end_date][['path', 'user_id', 'end_date', 'datetime', 'url', 'lesson']]

In [19]:
pd.DataFrame(ds_diff_time['path'].value_counts(ascending = False)).head(10)


Unnamed: 0,path
/,1436
search/search_index.json,493
sql/mysql-overview,275
classification/overview,266
classification/scale_features_or_not.svg,219
anomaly-detection/AnomalyDetectionCartoon.jpeg,193
anomaly-detection/overview,191
fundamentals/AI-ML-DL-timeline.jpg,189
fundamentals/modern-data-scientist.jpg,187
fundamentals/intro-to-data-science,184


In [20]:
# top lessons Data Science alumni reference
pd.DataFrame(ds_diff_time['lesson'].value_counts(ascending = False)).head(4)


Unnamed: 0,lesson
sql,1555
fundamentals,1514
pending,1472
classification,1312


In [21]:
# top lessons Web Dev alumni reference
pd.DataFrame(wd_diff_time['lesson'].value_counts(ascending = False)).head(4)


Unnamed: 0,lesson
java,20304
javascript,19794
pending,15290
html-css,13914


### 5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?

In [22]:
df.head()

Unnamed: 0,date,time,path,user_id,cohort_id,program_id,ip,name,start_date,end_date,created_at,program,url,subpath,counter
0,2018-01-26,09:55:03,/,1,8.0,1,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,web_dev,,"[, ]",1
1,2018-01-26,09:56:02,java-ii,1,8.0,1,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,web_dev,java-ii,[java-ii],1
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,1,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,web_dev,java-ii,"[java-ii, object-oriented-programming]",1
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,1,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,web_dev,slides,"[slides, object_oriented_programming]",1
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,2,97.105.19.61,Teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,web_dev,javascript-i,"[javascript-i, conditionals]",1


In [23]:
# set datetime
df_ds.date = pd.to_datetime(df_ds.date)


In [24]:
#set index
df_ds =df_ds.set_index('date')

In [25]:
#locaiton of 2019
ds_2019 = df_ds.loc['2019']

#20,000 values
## next how do we search? maybe by top hits?

In [26]:
# why do we just have bays?
ds_2019['name'].value_counts()

Bayes    20068
Name: name, dtype: int64

In [27]:
# okay so we don't have ada (didn't notice until now) - Bayes was 2019 hence why it is the only one showing
df_ds['name'].value_counts()

Darden      32015
Bayes       26538
Curie       21581
Easley      14715
Florence     8562
Name: name, dtype: int64

In [28]:
ds_2019.head()

# java	20304
# javascript	19794
# pending	15290
# html-css	13914

Unnamed: 0_level_0,time,path,user_id,cohort_id,program_id,ip,name,start_date,end_date,created_at,program,url,subpath,counter,lesson,url2,datetime
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2019-08-20,09:39:58,/,466,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,,"[, ]",1,pending,pending,2019-08-20
2019-08-20,09:39:59,/,467,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,,"[, ]",1,pending,pending,2019-08-20
2019-08-20,09:39:59,/,468,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,,"[, ]",1,pending,pending,2019-08-20
2019-08-20,09:40:02,/,469,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,,"[, ]",1,pending,pending,2019-08-20
2019-08-20,09:40:08,/,470,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,,"[, ]",1,pending,pending,2019-08-20


In [29]:
#don't know if this solves issues
ds_2019 = ds_2019.reset_index()

In [30]:
# I don't see anything for the data science students looking at webdev curriculum in 2019
ds_2019.lesson.value_counts()


fundamentals       2558
python             2183
regression         2106
sql                1913
stats              1678
classification     1557
pending            1456
clustering         1115
appendix           1105
nlp                 951
time_series         948
anomaly             910
storytelling        757
distributed_ml      428
search              338
advanced-topics      64
capstones             1
Name: lesson, dtype: int64

In [31]:
#look at web_dev
#following previous code but with wd
df_wd.date = pd.to_datetime(df_wd.date)

df_wd =df_wd.set_index('date')

wd_2019 = df_wd.loc['2019']


In [32]:
#alot bigger of a list than ds
wd_2019['name'].value_counts()

Ceres         36472
Zion          36290
Betelgeuse    27220
Andromeda     22768
Deimos        16927
Yosemite      11130
Europa         8735
Xanadu         7828
Lassen         3168
Olympic        3074
Ulysses        1875
Voyageurs      1533
Wrangell       1003
Sequoia         939
Teddy           922
Kings           596
Quincy          289
Pinnacles       216
Glacier         186
Arches          148
Ike             125
Mammoth         111
Niagara          66
Joshua           40
Badlands         16
Franklin         14
Name: name, dtype: int64

In [33]:
# top hits from DS alumni - starting point
# sql	1555
# fundamentals	1514
# pending	1472
# classification	1312

In [34]:
wd_2019.lesson.value_counts()
# I see mysql hits and potentially 'pending' but it applies to both

javascript    42064
java          36860
html-css      29070
mysql         20217
spring        14644
jquery        13823
pending       13802
toc            4730
search         3897
index          1149
capstone        873
php             471
structure        91
Name: lesson, dtype: int64

In [35]:
wd_2019_sql = wd_2019[wd_2019.lesson == 'mysql']
wd_2019_sql.tail(40)

Unnamed: 0_level_0,time,path,user_id,cohort_id,program_id,ip,name,start_date,end_date,created_at,program,url,subpath,counter,lesson,datetime
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2019-12-28,10:43:43,mysql/clauses/limit,488,51.0,2,99.100.172.39,Deimos,2019-09-16,2020-02-27,2019-09-16 13:07:04,web_dev,mysql,"[mysql, clauses, limit]",1,mysql,2019-12-28
2019-12-28,14:01:44,mysql,456,33.0,2,107.136.68.158,Ceres,2019-07-15,2019-12-11,2019-07-15 16:57:21,web_dev,mysql,[mysql],1,mysql,2019-12-28
2019-12-28,14:01:47,mysql/introduction,456,33.0,2,107.136.68.158,Ceres,2019-07-15,2019-12-11,2019-07-15 16:57:21,web_dev,mysql,"[mysql, introduction]",1,mysql,2019-12-28
2019-12-28,15:05:46,mysql,510,51.0,2,24.243.7.150,Deimos,2019-09-16,2020-02-27,2019-09-16 13:07:04,web_dev,mysql,[mysql],1,mysql,2019-12-28
2019-12-28,15:07:42,mysql/relationships,510,51.0,2,24.243.7.150,Deimos,2019-09-16,2020-02-27,2019-09-16 13:07:04,web_dev,mysql,"[mysql, relationships]",1,mysql,2019-12-28
2019-12-28,15:07:51,mysql/relationships/joins,510,51.0,2,24.243.7.150,Deimos,2019-09-16,2020-02-27,2019-09-16 13:07:04,web_dev,mysql,"[mysql, relationships, joins]",1,mysql,2019-12-28
2019-12-28,16:44:11,mysql,433,33.0,2,70.94.181.28,Ceres,2019-07-15,2019-12-11,2019-07-15 16:57:21,web_dev,mysql,[mysql],1,mysql,2019-12-28
2019-12-30,08:11:54,mysql//intellij,505,51.0,2,45.21.32.233,Deimos,2019-09-16,2020-02-27,2019-09-16 13:07:04,web_dev,mysql,"[mysql, , intellij]",1,mysql,2019-12-30
2019-12-30,08:11:55,mysql/img/favicon.ico,505,51.0,2,45.21.32.233,Deimos,2019-09-16,2020-02-27,2019-09-16 13:07:04,web_dev,mysql,"[mysql, img, favicon.ico]",1,mysql,2019-12-30
2019-12-30,08:12:12,mysql//extra-exercises,505,51.0,2,45.21.32.233,Deimos,2019-09-16,2020-02-27,2019-09-16 13:07:04,web_dev,mysql,"[mysql, , extra-exercises]",1,mysql,2019-12-30


In [36]:
# i see hits all the way to the last day of 2019 for the sql material
# lets double check it isn't in the Web dev curriculum

In [37]:
df_wd.lesson.value_counts()

javascript    149227
java          138797
html-css       92725
mysql          71913
pending        58516
spring         52344
jquery         50290
toc            16669
search         16000
capstone        4834
index           3587
structure       2900
php             2085
Name: lesson, dtype: int64

In [38]:
# it might be lesson not giving results, lets try url

In [39]:
wd_2019.url.value_counts().head(25)
### 

javascript-i               30618
html-css                   24649
mysql                      20779
jquery                     14875
spring                     14115
java-ii                    13835
java-iii                   11978
java-i                     10401
javascript-ii               9726
                            7683
appendix                    5896
toc                         4730
examples                    3983
search                      3897
content                     1745
web-design                  1010
index.html                   427
prework                      179
1-fundamentals               158
slides                       141
assets                        79
1._Fundamentals               49
4-python                      49
3.0-mysql-overview            35
study-session-with-ryan       26
Name: url, dtype: int64

In [40]:
#12-22
wd_2019_url = wd_2019[wd_2019.url == '1-fundamentals']
wd_2019_url.tail(40)

Unnamed: 0_level_0,time,path,user_id,cohort_id,program_id,ip,name,start_date,end_date,created_at,program,url,subpath,counter,lesson,datetime
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2019-10-15,13:56:32,1-fundamentals/AI-ML-DL-timeline.jpg,450,33.0,2,97.105.19.58,Ceres,2019-07-15,2019-12-11,2019-07-15 16:57:21,web_dev,1-fundamentals,"[1-fundamentals, AI-ML-DL-timeline.jpg]",1,pending,2019-10-15
2019-10-17,15:58:40,1-fundamentals/1.1-intro-to-data-science,451,33.0,2,97.105.19.58,Ceres,2019-07-15,2019-12-11,2019-07-15 16:57:21,web_dev,1-fundamentals,"[1-fundamentals, 1.1-intro-to-data-science]",1,pending,2019-10-17
2019-10-17,15:58:40,1-fundamentals/modern-data-scientist.jpg,451,33.0,2,97.105.19.58,Ceres,2019-07-15,2019-12-11,2019-07-15 16:57:21,web_dev,1-fundamentals,"[1-fundamentals, modern-data-scientist.jpg]",1,pending,2019-10-17
2019-10-17,15:58:40,1-fundamentals/AI-ML-DL-timeline.jpg,451,33.0,2,97.105.19.58,Ceres,2019-07-15,2019-12-11,2019-07-15 16:57:21,web_dev,1-fundamentals,"[1-fundamentals, AI-ML-DL-timeline.jpg]",1,pending,2019-10-17
2019-10-25,01:46:43,1-fundamentals/1.1-intro-to-data-science,513,7.0,1,173.239.232.172,Glacier,2015-06-05,2015-10-06,2016-06-14 19:52:26,web_dev,1-fundamentals,"[1-fundamentals, 1.1-intro-to-data-science]",1,pending,2019-10-25
2019-10-25,01:46:46,1-fundamentals/modern-data-scientist.jpg,513,7.0,1,173.239.232.172,Glacier,2015-06-05,2015-10-06,2016-06-14 19:52:26,web_dev,1-fundamentals,"[1-fundamentals, modern-data-scientist.jpg]",1,pending,2019-10-25
2019-10-25,01:46:47,1-fundamentals/AI-ML-DL-timeline.jpg,513,7.0,1,173.239.232.172,Glacier,2015-06-05,2015-10-06,2016-06-14 19:52:26,web_dev,1-fundamentals,"[1-fundamentals, AI-ML-DL-timeline.jpg]",1,pending,2019-10-25
2019-10-25,12:58:53,1-fundamentals/1.1-intro-to-data-science,513,7.0,1,74.81.88.18,Glacier,2015-06-05,2015-10-06,2016-06-14 19:52:26,web_dev,1-fundamentals,"[1-fundamentals, 1.1-intro-to-data-science]",1,pending,2019-10-25
2019-10-25,12:58:54,1-fundamentals/modern-data-scientist.jpg,513,7.0,1,74.81.88.18,Glacier,2015-06-05,2015-10-06,2016-06-14 19:52:26,web_dev,1-fundamentals,"[1-fundamentals, modern-data-scientist.jpg]",1,pending,2019-10-25
2019-10-25,12:58:54,1-fundamentals/AI-ML-DL-timeline.jpg,513,7.0,1,74.81.88.18,Glacier,2015-06-05,2015-10-06,2016-06-14 19:52:26,web_dev,1-fundamentals,"[1-fundamentals, AI-ML-DL-timeline.jpg]",1,pending,2019-10-25


In [41]:
#12-22
wd_2019_url = wd_2019[wd_2019.url == '4-python']
wd_2019_url.tail(40)

Unnamed: 0_level_0,time,path,user_id,cohort_id,program_id,ip,name,start_date,end_date,created_at,program,url,subpath,counter,lesson,datetime
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2019-08-24,16:18:15,4-python/1-overview,423,32.0,2,192.171.117.210,Betelgeuse,2019-05-28,2019-10-08,2019-05-28 18:41:05,web_dev,4-python,"[4-python, 1-overview]",1,pending,2019-08-24
2019-09-10,08:38:54,4-python/1-overview,420,32.0,2,97.105.19.58,Betelgeuse,2019-05-28,2019-10-08,2019-05-28 18:41:05,web_dev,4-python,"[4-python, 1-overview]",1,pending,2019-09-10
2019-09-14,11:38:21,4-python/1-overview,420,32.0,2,97.105.19.58,Betelgeuse,2019-05-28,2019-10-08,2019-05-28 18:41:05,web_dev,4-python,"[4-python, 1-overview]",1,pending,2019-09-14
2019-09-14,11:38:23,4-python/5-functions,420,32.0,2,97.105.19.58,Betelgeuse,2019-05-28,2019-10-08,2019-05-28 18:41:05,web_dev,4-python,"[4-python, 5-functions]",1,pending,2019-09-14
2019-09-14,11:38:30,4-python/6-imports,420,32.0,2,97.105.19.58,Betelgeuse,2019-05-28,2019-10-08,2019-05-28 18:41:05,web_dev,4-python,"[4-python, 6-imports]",1,pending,2019-09-14
2019-09-14,11:38:36,4-python/project,420,32.0,2,97.105.19.58,Betelgeuse,2019-05-28,2019-10-08,2019-05-28 18:41:05,web_dev,4-python,"[4-python, project]",1,pending,2019-09-14
2019-09-14,11:38:39,4-python/3-data-types-and-variables,420,32.0,2,97.105.19.58,Betelgeuse,2019-05-28,2019-10-08,2019-05-28 18:41:05,web_dev,4-python,"[4-python, 3-data-types-and-variables]",1,pending,2019-09-14
2019-09-14,11:39:26,4-python/1-overview,420,32.0,2,97.105.19.58,Betelgeuse,2019-05-28,2019-10-08,2019-05-28 18:41:05,web_dev,4-python,"[4-python, 1-overview]",1,pending,2019-09-14
2019-09-14,11:39:30,4-python/2-introduction-to-python,420,32.0,2,97.105.19.58,Betelgeuse,2019-05-28,2019-10-08,2019-05-28 18:41:05,web_dev,4-python,"[4-python, 2-introduction-to-python]",1,pending,2019-09-14
2019-09-14,12:49:02,4-python/1-overview,420,32.0,2,97.105.19.58,Betelgeuse,2019-05-28,2019-10-08,2019-05-28 18:41:05,web_dev,4-python,"[4-python, 1-overview]",1,pending,2019-09-14


In [42]:
### we can see that the list hit of DS material from a webdev student was user 18 on 12-22
### date stamps seem they could have possibly been hitting for the quick hits (looking at fundamentals)
### also note that it looks like user 458 was potentially scraping on 11-12

In [43]:
#looking at DS to WebDev
ds_2019.url.value_counts().head(25)

#htmlm javascript, java-1 (lets look at these)

1-fundamentals                      2558
4-python                            2181
6-regression                        2106
3-sql                               1910
5-stats                             1678
7-classification                    1556
                                    1324
8-clustering                        1112
appendix                            1105
11-nlp                               951
9-timeseries                         946
10-anomaly-detection                 910
2-storytelling                       757
12-distributed-ml                    429
search                               337
13-advanced-topics                    64
html-css                              26
3-vocabulary.md                       12
javascript-i                          12
toc                                    9
3-discrete-probabilistic-methods       9
java-i                                 8
spring                                 7
login                                  5
java-iii        

In [44]:
ds_2019_url = ds_2019[ds_2019.url == 'html-css']
ds_2019_url.tail(20)

Unnamed: 0,date,time,path,user_id,cohort_id,program_id,ip,name,start_date,end_date,created_at,program,url,subpath,counter,lesson,url2,datetime
7501,2019-09-27,06:49:23,html-css,476,34.0,3,136.50.49.145,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,html-css,[html-css],1,pending,pending,2019-09-27
7502,2019-09-27,06:49:32,html-css/introduction,476,34.0,3,136.50.49.145,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,html-css,"[html-css, introduction]",1,pending,pending,2019-09-27
7503,2019-09-27,06:49:42,html-css/elements,476,34.0,3,136.50.49.145,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,html-css,"[html-css, elements]",1,pending,pending,2019-09-27
16400,2019-11-25,14:26:13,html-css,476,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,html-css,[html-css],1,pending,pending,2019-11-25
16410,2019-11-25,14:26:28,html-css,476,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,html-css,[html-css],1,pending,pending,2019-11-25
16512,2019-11-26,10:37:29,html-css,476,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,html-css,[html-css],1,pending,pending,2019-11-26
16528,2019-11-26,10:54:59,html-css,476,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,html-css,[html-css],1,pending,pending,2019-11-26
16571,2019-11-26,15:53:24,html-css,467,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,html-css,[html-css],1,pending,pending,2019-11-26
16583,2019-11-26,16:42:20,html-css,472,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,html-css,[html-css],1,pending,pending,2019-11-26
16584,2019-11-26,16:42:25,html-css/introduction,472,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,html-css,"[html-css, introduction]",1,pending,pending,2019-11-26


In [45]:
ds_2019_url = ds_2019[ds_2019.url == 'javascript-i']
ds_2019_url.tail(20)

Unnamed: 0,date,time,path,user_id,cohort_id,program_id,ip,name,start_date,end_date,created_at,program,url,subpath,counter,lesson,url2,datetime
7318,2019-09-25,19:30:44,javascript-i,476,34.0,3,136.50.49.145,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,[javascript-i],1,pending,pending,2019-09-25
7320,2019-09-25,19:31:07,javascript-i,476,34.0,3,136.50.49.145,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,[javascript-i],1,pending,pending,2019-09-25
7322,2019-09-25,19:31:12,javascript-i,476,34.0,3,136.50.49.145,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,[javascript-i],1,pending,pending,2019-09-25
7329,2019-09-25,19:32:23,javascript-i,476,34.0,3,136.50.49.145,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,[javascript-i],1,pending,pending,2019-09-25
16401,2019-11-25,14:26:14,javascript-i,476,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,[javascript-i],1,pending,pending,2019-11-25
16503,2019-11-26,10:31:16,javascript-i/bom-and-dom/dom,467,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,"[javascript-i, bom-and-dom, dom]",1,pending,pending,2019-11-26
16513,2019-11-26,10:37:35,javascript-i,476,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,[javascript-i],1,pending,pending,2019-11-26
16529,2019-11-26,10:55:05,javascript-i,476,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,[javascript-i],1,pending,pending,2019-11-26
16569,2019-11-26,15:32:05,javascript-i/conditionals,472,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,"[javascript-i, conditionals]",1,pending,pending,2019-11-26
16878,2019-12-03,10:04:40,javascript-i,467,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,[javascript-i],1,pending,pending,2019-12-03


In [46]:
ds_2019_url = ds_2019[ds_2019.url == 'java-i']
ds_2019_url.tail(20)

Unnamed: 0,date,time,path,user_id,cohort_id,program_id,ip,name,start_date,end_date,created_at,program,url,subpath,counter,lesson,url2,datetime
7321,2019-09-25,19:31:11,java-i,476,34.0,3,136.50.49.145,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,java-i,[java-i],1,pending,pending,2019-09-25
7323,2019-09-25,19:31:14,java-i,476,34.0,3,136.50.49.145,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,java-i,[java-i],1,pending,pending,2019-09-25
7330,2019-09-25,19:32:34,java-i,476,34.0,3,136.50.49.145,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,java-i,[java-i],1,pending,pending,2019-09-25
7331,2019-09-25,19:32:38,java-i,476,34.0,3,136.50.49.145,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,java-i,[java-i],1,pending,pending,2019-09-25
7332,2019-09-25,19:32:44,java-i/console-io,476,34.0,3,136.50.49.145,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,java-i,"[java-i, console-io]",1,pending,pending,2019-09-25
16404,2019-11-25,14:26:17,java-i,476,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,java-i,[java-i],1,pending,pending,2019-11-25
16515,2019-11-26,10:38:06,java-i,476,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,java-i,[java-i],1,pending,pending,2019-11-26
16881,2019-12-03,10:04:48,java-i,467,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,java-i,[java-i],1,pending,pending,2019-12-03


In [47]:
### last access of a WD student accessing DS was 12-22-19

### answer:
- The last access of a WD student accessing DS material was 12-22-19
- The last access of a DS student accessing WD material was 12-14-19

There is no determined cut off date for accessing the material until sometime in the last two weeks of 2019. There is also a few students who potentially could be scraping the material due to their access behavior.

In [48]:
########

In [49]:
# I have been wondering what technically '/' stands for - open search?
df.path.value_counts()

/                             39514
toc                           16680
javascript-i                  16386
search/search_index.json      16185
html-css                      11843
                              ...  
About_NLP                         1
8.0_Intro_Module                  1
introduction-to-matplotlib        1
2.0_Intro_Stats                   1
13.5_Tableau                      1
Name: path, Length: 1844, dtype: int64

### Followup quesitons

- 1. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?
- 2. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students

In [50]:
### hit count under 200? for students

df.head()

Unnamed: 0,date,time,path,user_id,cohort_id,program_id,ip,name,start_date,end_date,created_at,program,url,subpath,counter
0,2018-01-26,09:55:03,/,1,8.0,1,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,web_dev,,"[, ]",1
1,2018-01-26,09:56:02,java-ii,1,8.0,1,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,web_dev,java-ii,[java-ii],1
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,1,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,web_dev,java-ii,"[java-ii, object-oriented-programming]",1
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,1,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,web_dev,slides,"[slides, object_oriented_programming]",1
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,2,97.105.19.61,Teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,web_dev,javascript-i,"[javascript-i, conditionals]",1


In [51]:
#groupby userid? value_counts?

In [52]:
# okay so added a count column that i can groupby user_id by?

df['counter'] = 1

In [53]:
df.tail()

Unnamed: 0,date,time,path,user_id,cohort_id,program_id,ip,name,start_date,end_date,created_at,program,url,subpath,counter
847317,2021-04-21,16:36:09,jquery/personal-site,869,135.0,2,136.50.98.51,Marco,2021-01-25,2021-07-19,2021-01-20 21:31:11,web_dev,jquery,"[jquery, personal-site]",1
847318,2021-04-21,16:36:34,html-css/css-ii/bootstrap-grid-system,948,138.0,2,104.48.214.211,Neptune,2021-03-15,2021-09-03,2021-03-15 19:57:09,web_dev,html-css,"[html-css, css-ii, bootstrap-grid-system]",1
847319,2021-04-21,16:37:48,java-iii,834,134.0,2,67.11.50.23,Luna,2020-12-07,2021-06-08,2020-12-07 16:58:43,web_dev,java-iii,[java-iii],1
847320,2021-04-21,16:38:14,java-iii/servlets,834,134.0,2,67.11.50.23,Luna,2020-12-07,2021-06-08,2020-12-07 16:58:43,web_dev,java-iii,"[java-iii, servlets]",1
847324,2021-04-21,16:41:51,javascript-i/bom-and-dom/dom,875,135.0,2,24.242.150.231,Marco,2021-01-25,2021-07-19,2021-01-20 21:31:11,web_dev,javascript-i,"[javascript-i, bom-and-dom, dom]",1


In [54]:
#grouping user_id by counter sum column to get a number of hits
student_counter = (df.groupby('user_id').counter.sum().sort_values())
#change to DdataFrame
student_counter=pd.DataFrame(student_counter)

student_counter

Unnamed: 0_level_0,counter
user_id,Unnamed: 1_level_1
649,1
918,1
593,1
952,1
940,1
...,...
423,3804
570,4584
344,5460
495,6451


In [55]:
student_counter[(student_counter <= 50)].count()
# we have 111 students that represent less than 50 hits

counter    111
dtype: int64

In [56]:
student_counter[(student_counter > 50)].count()
# we have 775 students that represent more than 50 hits

counter    775
dtype: int64

In [57]:
#defining lowest 10%
bottom_ten = student_counter.counter.quantile(.10)

bottom_ten # value of 34 students 

34.0

In [58]:
##### greater than lowest ten right?
student_counter = student_counter[student_counter.counter < bottom_ten].reset_index()


In [59]:
student_counter

Unnamed: 0,user_id,counter
0,649,1
1,918,1
2,593,1
3,952,1
4,940,1
...,...,...
83,463,30
84,112,32
85,740,32
86,778,33


### I can further attempt to answer this but the issues we face is that 5 cohorts end date is past the last date of the this dataframe giving unfair balance to those students in understanding the amount of hits.

In [60]:
student_counter.user_id

0     649
1     918
2     593
3     952
4     940
     ... 
83    463
84    112
85    740
86    778
87    611
Name: user_id, Length: 88, dtype: int64

In [61]:
# #going to concat to the main dataframe
# student_count = pd.concat([student_counter, df])
# student_count.isnull().sum()

# student_count.counter.sort_values().tail(20)

In [62]:
######

In [63]:
df.name.value_counts()

Ceres         40730
Zion          38096
Jupiter       37109
Fortuna       36902
Voyageurs     35636
Ganymede      33844
Apex          33568
Deimos        32888
Darden        32015
Teddy         30926
Hyperion      29855
Betelgeuse    29356
Ulysses       28534
Europa        28033
Xanadu        27749
Bayes         26538
Wrangell      25586
Andromeda     25359
Kalypso       23691
Curie         21581
Yosemite      20743
Bash          17713
Luna          16623
Marco         16397
Easley        14715
Lassen         9587
Arches         8890
Florence       8562
Sequoia        7444
Neptune        7276
Olympic        4954
Kings          2845
Pinnacles      2158
Hampton        1712
Oberon         1672
Quincy         1237
Niagara         755
Mammoth         691
Glacier         598
Joshua          302
Ike             253
Badlands         93
Franklin         72
Apollo            5
Denali            4
Everglades        1
Name: name, dtype: int64

In [64]:

df[['start_date', 'name']].value_counts()

start_date  name      
2019-07-15  Ceres         40730
2019-01-22  Zion          38096
2020-09-21  Jupiter       37109
2020-01-13  Fortuna       36902
2018-05-29  Voyageurs     35636
2020-03-23  Ganymede      33844
2020-02-24  Apex          33568
2019-09-16  Deimos        32888
2020-07-13  Darden        32015
2018-01-08  Teddy         30926
2020-05-26  Hyperion      29855
2019-05-28  Betelgeuse    29356
2018-03-05  Ulysses       28534
2019-11-04  Europa        28033
2018-09-17  Xanadu        27749
2019-08-19  Bayes         26538
2018-07-23  Wrangell      25586
2019-03-18  Andromeda     25359
2020-11-02  Kalypso       23691
2020-02-03  Curie         21581
2018-11-05  Yosemite      20743
2020-07-20  Bash          17713
2020-12-07  Luna          16623
2021-01-25  Marco         16397
2020-12-07  Easley        14715
2016-07-18  Lassen         9587
2014-02-04  Arches         8890
2021-03-15  Florence       8562
2017-09-27  Sequoia        7444
2021-03-15  Neptune        7276
2017-02-06  Olymp

In [65]:
df_ds[['start_date', 'name']].value_counts()

start_date  name    
2020-07-13  Darden      32015
2019-08-19  Bayes       26538
2020-02-03  Curie       21581
2020-12-07  Easley      14715
2021-03-15  Florence     8562
dtype: int64

In [66]:
df_wd[['start_date', 'name']].value_counts()

start_date  name      
2019-07-15  Ceres         40730
2019-01-22  Zion          38096
2020-09-21  Jupiter       37109
2020-01-13  Fortuna       36902
2018-05-29  Voyageurs     35636
2020-03-23  Ganymede      33844
2020-02-24  Apex          33568
2019-09-16  Deimos        32888
2018-01-08  Teddy         30926
2020-05-26  Hyperion      29855
2019-05-28  Betelgeuse    29356
2018-03-05  Ulysses       28534
2019-11-04  Europa        28033
2018-09-17  Xanadu        27749
2018-07-23  Wrangell      25586
2019-03-18  Andromeda     25359
2020-11-02  Kalypso       23691
2018-11-05  Yosemite      20743
2020-07-20  Bash          17713
2020-12-07  Luna          16623
2021-01-25  Marco         16397
2016-07-18  Lassen         9587
2014-02-04  Arches         8890
2017-09-27  Sequoia        7444
2021-03-15  Neptune        7276
2017-02-06  Olympic        4954
2016-05-23  Kings          2845
2017-03-27  Pinnacles      2158
2015-09-22  Hampton        1712
2021-04-12  Oberon         1672
2017-06-05  Quinc

In [67]:
df.start_date.max(), df.end_date.max()

('2021-04-12', '2021-10-01')

In [68]:
df = df.reset_index()

In [69]:
### IMPORTANT TO NOTE THAT THE LAST DATE FOR THIS 

df.date.max()

'2021-04-21'

In [70]:
df.end_date.value_counts().sort_index()

2014-04-22     8890
2014-08-22       93
2015-01-18        4
2015-02-24        1
2015-05-26       72
2015-07-29        5
2015-10-06      598
2016-02-06     1712
2016-05-12      253
2016-06-30      302
2016-09-15     2845
2016-11-10     9587
2017-02-02      691
2017-03-09      755
2017-05-25     4954
2017-07-20     2158
2017-09-22     1237
2018-02-15     7444
2018-05-17    30926
2018-07-19    28534
2018-10-11    35636
2018-11-29    25586
2019-02-08    27749
2019-04-03    20743
2019-06-04    38096
2019-07-30    25359
2019-10-08    29356
2019-12-11    40730
2020-01-30    26538
2020-02-27    32888
2020-04-17    28033
2020-06-23    36902
2020-07-07    21581
2020-07-29    33568
2020-08-20    33844
2020-11-10    29855
2021-01-12    32015
2021-01-21    17713
2021-03-30    37109
2021-05-04    23691
2021-06-08    31338
2021-07-19    16397
2021-09-03    15838
2021-10-01     1672
Name: end_date, dtype: int64

In [71]:
df_wd[['end_date', 'name']].value_counts().sort_index()

end_date    name      
2014-04-22  Arches         8890
2014-08-22  Badlands         93
2015-01-18  Denali            4
2015-02-24  Everglades        1
2015-05-26  Franklin         72
2015-07-29  Apollo            5
2015-10-06  Glacier         598
2016-02-06  Hampton        1712
2016-05-12  Ike             253
2016-06-30  Joshua          302
2016-09-15  Kings          2845
2016-11-10  Lassen         9587
2017-02-02  Mammoth         691
2017-03-09  Niagara         755
2017-05-25  Olympic        4954
2017-07-20  Pinnacles      2158
2017-09-22  Quincy         1237
2018-02-15  Sequoia        7444
2018-05-17  Teddy         30926
2018-07-19  Ulysses       28534
2018-10-11  Voyageurs     35636
2018-11-29  Wrangell      25586
2019-02-08  Xanadu        27749
2019-04-03  Yosemite      20743
2019-06-04  Zion          38096
2019-07-30  Andromeda     25359
2019-10-08  Betelgeuse    29356
2019-12-11  Ceres         40730
2020-02-27  Deimos        32888
2020-04-17  Europa        28033
2020-06-23  Fortu