In [1]:
# DS Modules
import numpy as np
import pandas as pd

# Visualization modules
import matplotlib.pyplot as plt
import seaborn as sns

import acquire
import ip_tk

### Detecting suspicious activity using IP geolocation.

By collecting IP address location information from a public API, we were able to gather additional information about how users were accessing CodeUp's cirruculum.

In [3]:
# Use a helper utility to merge ip information into the dataframe
df = ip_tk.wrangle_ip_merged()

In [17]:
# Show off the kind of information collected
df[['user_id','ip','city','regionName','countryName','latitude','longitude']].sample(5, random_state = 8)

Unnamed: 0_level_0,user_id,ip,city,regionName,countryName,latitude,longitude
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-10-26 11:55:53,769,98.197.56.20,Houston,Texas,United States,29.7912,-95.4182
2019-01-15 11:06:19,267,97.105.19.58,Euless,Texas,United States,32.8548,-97.0819
2018-08-07 13:10:12,40,97.105.19.58,Euless,Texas,United States,32.8548,-97.0819
2021-02-14 12:39:19,788,173.175.108.125,San Antonio,Texas,United States,29.4551,-98.6498
2020-10-17 23:20:16,668,70.94.168.22,San Antonio,Texas,United States,29.4812,-98.3435


Now, because we have both a timestamp and location information associated with each log entry, we are able to see if anyone is moving around the world impossibly fast.

As an example of how we can use this information, we can ask if anyone was in multiple countries within the same hour.

In [18]:
# Get geohopping events by the hour
ip_tk.detect_country_geohop_events(df, window='1H', group_by='user_id')

Unnamed: 0,when_start,when_end,where,user_id
0,2020-02-15 00:00:00,2020-02-15 01:00:00,"[Mexico, United States]",64
1,2018-07-28 11:00:00,2018-07-28 12:00:00,"[United States, Canada]",128
2,2018-07-29 16:00:00,2018-07-29 17:00:00,"[United States, Canada]",128
3,2019-01-11 09:00:00,2019-01-11 10:00:00,"[Germany, United States]",270
4,2019-01-14 09:00:00,2019-01-14 10:00:00,"[United States, Germany]",270
5,2019-12-08 12:00:00,2019-12-08 13:00:00,"[Germany, Australia]",469
6,2019-12-12 10:00:00,2019-12-12 11:00:00,"[Australia, United States]",469
7,2019-12-16 10:00:00,2019-12-16 11:00:00,"[Canada, United States]",469
8,2020-01-03 21:00:00,2020-01-03 22:00:00,"[Canada, United States]",469
9,2020-04-16 16:00:00,2020-04-16 17:00:00,"[Switzerland, United States]",570


This is symptomatic evidence of leaked credentials.  It is also possible that some users used VPNs to tunnel their traffic from different countries, but these events should be taken seriously nonetheless.

An small window, such as a single hour, gives us very confident results.  It is very unlikely that any of these users could travel between these countries so quickly.  However, the small window means that we miss out on attacks that may have happened during off hours or weekends.  To compensate for that we can ask the same question with a larger window.

In [19]:
# Get geohopping events by the week
ip_tk.detect_country_geohop_events(df, window='1W', group_by='user_id')

Unnamed: 0,when_start,when_end,where,user_id
0,2018-04-01,2018-04-08,"[United States, Mexico]",3
1,2018-04-08,2018-04-15,"[United States, Mexico]",3
2,2018-04-01,2018-04-08,"[United States, Germany]",12
3,2020-02-02,2020-02-09,"[Canada, United States]",12
4,2018-04-22,2018-04-29,"[United States, Canada, France]",32
5,2018-03-11,2018-03-18,"[United States, Mexico]",64
6,2019-08-04,2019-08-11,"[United States, Mexico]",64
7,2019-12-01,2019-12-08,"[United States, Canada]",64
8,2019-12-22,2019-12-29,"[United States, Mexico]",64
9,2020-01-19,2020-01-26,"[United States, Mexico]",64


#### Takeaways
There is very strong evidence to suggest that some users may have compromised credentials.

### Confirming the access policy change in 2019

The data that we have access to does not include if a user is accessing a lesson that they are supposed to have access to.  In order to answer this question we will need to improvise.

An assumption can be made that a module belongs to the program where the users belonging to that program access it the most.  In order to tackle this question we will be making that assumption.