### MY470 Computer Programming

### Final Assignment, MT 2022

#### \*\*\* Due 12:00 noon on Monday, January 23, 2023 \*\*\*

---
## The homophily and social contagion of cheating

The final assignment asks you to use the computational thinking and programming skills you learned in the course to answer an empirical social science question. You are expected to apply the best practices and theoretical concepts we covered in the course to produce a program that not only returns the correct output but is also legible, modular, and reasonably optimized. The assignment assumes mastery of loops, conditionals, and functions, as well as awareness of issues related to runtime performance.

In the assignment, we will study the homophily and social contagion of cheating in the massive multiplayer online game PlayerUnknown's Battlegrounds (PUBG). Cheating in this context means the adoption of unapproved software that gives the player an unfair advantage in the game (e.g. being able to see through walls). 

Our hypotheses are that cheaters tend to associate with other cheaters but also, players who interact with cheaters become likely to adopt cheating themselves. To provide preliminary evidence for these hypotheses, we will:

1. Observe whether cheaters tend to team up with other cheaters more than chance.
2. Observe whether players who observe cheaters are likely to become cheaters more than chance.
3. Observe whether players who are killed by cheaters are likely to become cheaters more than chance.

To test the "more than chance" part, we will simulate alternative universes in which the players played the same game but joined a different team or happened to be killed by someone else at a different time. We will then compare how what we observe in the actual data compares to what we would expect in a "randomized" world.  

**NOTE: You are only allowed to use fundamental Python data types (lists, tuples, dictionaries, numpy.ndarray, etc.) to complete this assignment.** You are not allowed to use advanced data querying and data analysis packages such as pandas, sqlite, networkx, or similar. We impose this restriction in order to test your grasp of fundamental programming concepts, not your scripting experience with Python libraries from before or from other courses you may be taking. 

#### Hints

Although this assignment is quite streamlined, imagine that the tasks here are part of a larger project. How would you structure your program if in the future you may need to use a different dataset with similar structure, manipulate the data differently, add additional analyses, or modify the focus of the current analysis?  

Keep different data manipulations in separate functions/methods and group related functions/classes in separate `.py` files. Name your modules in an informative way.

### Data

You will find the data in the repository [https://github.com/lse-my470/assignment-final-data.git](https://github.com/lse-my470/assignment-final-data.git). Please clone the data repository in the same directory where you clone the repository `assignment-final-yourgithubname`. Keep the name for the data folder `assignment-final-data`. Any time when you refer to the data in your code, please use a relative path such as `'../assignment-final-data/filename.txt'` instead of an absolute path such as `'/Users/myname/Documents/my470/assignment-final-data/filename.txt'`. This way, we will be able to test your submission with our own copy of the data without having to modify your code.

The data were collected by Jinny Kim (LSE MSc ASDS '19). The repository contains the following files:

* `cheaters.txt` – contains cheaters who played between March 1 and March 10, 2019
    1. player account id
    2. estimated date when the player started cheating
    3. date when the player's account was banned due to cheating


* `kills.txt` – contains the killings done in 6,000 randomly selected matches played between March 1 and March 10, 2019
    1. match id 
    2. account id of the killer
    3. account id of the player who got killed
    4. time when the kill happened
 
 
* `team_ids.txt` – contains the team ids for players in 5,419 team-play matches in the same period. If a match from the kills.txt file does not appear in these data, we will assume that it was in single-player mode.  
    1. match id 
    2. player account id
    3. team id in match
    
You should not modify the original data in any way. Similarly, you should not duplicate the data in this repository but instead use a relative path to access them.

### Output

The tasks ask you to output actual counts and expecteded counts (mean with 95% confidence interval). To estimate the 95% conifdence intervals, ignore the small sample size and the fact that we are dealing with count data, and simply use the approximation: 95% CI $= \mu \pm 1.96 \frac{\sigma}{\sqrt{n}}$, where $\mu$ is the mean and $\sigma$ the standard deviation of the counts in the $n=20$ randomizations. You are free to use `statsmodels` or `numpy` to calculate these values.


#### Hints

When writing your code, test it on a small "toy dataset", instead of the entire data. This way, you won't need to wait for minutes/hours just to find out that you have a syntax error!

If the randomization is time consuming, it may be worth finding a way to save the data you generate on hard disk so that you don't need to run the randomization again and again. If you decide to do so, please write your code to save any such files with processed data in the directory where this file resides. This way, we can run your code without having to alter it.

If you need to save any new data, think carefully about the most efficient way, both in terms of time and space, to save them.

## Import and run your code here

Keep your code in separate `.py` files and then import it in the code cell below. In the subsequent cells, call the functions/methods you need to conduct the requested analyses. We should be able to run all cells here to calculate again the results and get the requested output, without having to modify your code in any way. 

In [1]:
# These are my own modules.
import load
import wrangle
import calculate

# These are external modules.
import numpy 
import statsmodels
import statistics
import math
import random
from datetime import datetime
import collections

### 1. Do cheaters team up?

Use the files `cheaters.txt` and `team_ids.txt` to estimate how often cheaters (regardless of when exactly they started cheating) end up on the same team. Your output should say how many teams have 0, 1, 2, or 4 cheaters.

Now, randomly shuffle the team ids among the players in a match. Repeat this 20 times and estimate the expected counts as before. Output the mean and the 95% confidence intervals for the expected counts. 

*Optional: Conclude in a short comment what you observe. This reflection is optional and will not be marked.*


In [2]:
# Calculating observed number of teams per number of cheaters

# Load cheaters data
cheaters = load.load_first_line('../assignment-final-data/cheaters.txt')

# Load teams data
teams = load.load_all_sorted('../assignment-final-data/team_ids.txt')

# Add dummy for whether a player is a cheater
wrangle.add_dummy_end(teams, cheaters)

# Calculate number of cheaters per team by counting dummy variable
match_team_cheaters_dic = calculate.calc_dummy_count_per_key(teams)

# Count number of teams with 0, 1, 2, 3, 4 cheaters
actual_cheaters_list = calculate.get_no_cheaters(match_team_cheaters_dic, 5)

print("There are", actual_cheaters_list[0],"observed instances of 0 cheaters in a team.")
print("There are", actual_cheaters_list[1],"observed instances of 1 cheater in a team.")
print("There are", actual_cheaters_list[2],"observed instances of 2 cheaters in a team.")
print("There are", actual_cheaters_list[3],"observed instances of 3 cheaters in a team.")
print("There are", actual_cheaters_list[4],"observed instances of 4 cheaters in a team.")
print(actual_cheaters_list)

There are 170782 observed instances of 0 cheaters in a team.
There are 3199 observed instances of 1 cheater in a team.
There are 182 observed instances of 2 cheaters in a team.
There are 9 observed instances of 3 cheaters in a team.
There are 2 observed instances of 4 cheaters in a team.
[170782, 3199, 182, 9, 2]


In [3]:
# Now for the randomisation and testing of the null hypothesis.
# I have chosen to run a for loop 20 times and keep the functions visible to this level of modularity for legibility and understanding of logic. 
# I could have wrapped this all up in one function named, for example, 'part 3,' 
# but I feel like that hides the logic and makes it harder to understand.

# Run 20 times and get the number of teams with 0, 1, 2, 3, 4 cheaters
cheaters_pt = []

for i in range(20):
        cheaters_pt.append(calculate.get_random_teams_per_cheaters(teams))

# Calculate the number of teams with 0, 1, 2, 3, 4 cheaters for each randomisation
cheaters = wrangle.get_cheaters_per_match_team(cheaters_pt)

# Get stats for each number of cheaters
zero_cheaters_stats = calculate.get_stats_list(cheaters[0])
one_cheater_stats = calculate.get_stats_list(cheaters[1])
two_cheaters_stats = calculate.get_stats_list(cheaters[2])
three_cheaters_stats = calculate.get_stats_list(cheaters[3])
four_cheaters_stats = calculate.get_stats_list(cheaters[4])

print("There is an expected count of", zero_cheaters_stats[0], 
      "instances of 0 cheaters in a team, with a lower 95% confidence interval of", 
      zero_cheaters_stats[2], "and an upper 95% confidence interval of", 
      zero_cheaters_stats[3],".") 

print("There is an expected count of", one_cheater_stats[0], 
      "instances of 1 cheaters in a team, with a lower 95% confidence interval of", 
      one_cheater_stats[2], "and an upper 95% confidence interval of", 
      one_cheater_stats[3],".") 

print("There is an expected count of", two_cheaters_stats[0], 
      "instances of 2 cheaters in a team, with a lower 95% confidence interval of", 
      two_cheaters_stats[2], "and an upper 95% confidence interval of", 
      two_cheaters_stats[3],".") 

print("There is an expected count of", three_cheaters_stats[0], 
      "instances of 3 cheaters in a team, with a lower 95% confidence interval of", 
      three_cheaters_stats[2], "and an upper 95% confidence interval of", 
      three_cheaters_stats[3],".") 

print("There is an expected count of", four_cheaters_stats[0], 
      "instances of 4 cheaters in a team, with a lower 95% confidence interval of", 
      four_cheaters_stats[2], "and an upper 95% confidence interval of", 
      four_cheaters_stats[3],".") 

There is an expected count of 170610.85 instances of 0 cheaters in a team, with a lower 95% confidence interval of 170613.02176770108 and an upper 95% confidence interval of 170608.67823229893 .
There is an expected count of 3528.6 instances of 1 cheaters in a team, with a lower 95% confidence interval of 3532.8714842976724 and an upper 95% confidence interval of 3524.3285157023274 .
There is an expected count of 34.25 instances of 2 cheaters in a team, with a lower 95% confidence interval of 36.31975589712512 and an upper 95% confidence interval of 32.18024410287488 .
There is an expected count of 0.3 instances of 3 cheaters in a team, with a lower 95% confidence interval of 0.5503572184741339 and an upper 95% confidence interval of 0.04964278152586604 .
There is an expected count of 0 instances of 4 cheaters in a team, with a lower 95% confidence interval of 0.0 and an upper 95% confidence interval of 0.0 .


__Part 1: Conclusions__

The function of this analysis is to test whether cheaters tend to team up together. In this case, the randomisation of teams 20 times acts as the null hypothesis against which our observed values have been tested. This null hypothesis is that cheaters do not tend to team up with each other. We are only able to reject the null hypothesis if the counts for the number of teams with 2, 3, and 4 cheaters on a team exceed the upper confidence interval of the expected counts. 

We can conclude that __yes, cheaters do tend to team up__ as we observe more teams with 2, 3 and 4 cheaters than expected under the null hypothesis at a statistical significance level of 0.05. For example, we expect between around 33 and 36 teams with 2 cheaters in them (each randomisation will be different, but these are general figures), but we observe 182. We expect less than 1 team with 3 or 4 cheaters on it, but observe 9 and 2 teams. 

### 2. Do victims of cheating start cheating?

Use the files `cheaters.txt` and `kills.txt` to count how many players got killed by an active cheater on at least one occasion and then started cheating. Specifically, we are interested in situations where:

1. Player B has started cheating but player A is not cheating.
2. Player B kills player A.
3. At some point afterwards, player A starts cheating.

Output the count in the data. 

Then, simulate alternative worlds in which everything is the same but the events took somewhat different sequence. To do so, randomize within a game, keeping the timing and structure of interactions but shuffling the player ids. Generate 20 randomizations like this and estimate the expected count of victims of cheating who start cheating as before. Output the mean and the 95% confidence interval for the expected count in these randomized worlds.

*Optional: Conclude in a short comment what you observe. This reflection is optional and will not be marked.*

#### Hint

Starting time of cheating is estimated as a date, so assume that a player cheats on any match that they started playing on that date or afterwards. Use the match starting date so that if the match started before midnight of the cheating date but ended after midnight, we will assume that the player was not cheating just yet. 


In [4]:
# 2.1 Calculate the observed number of instances of a victim becoming a cheater after being killed by a cheater.

# Load kills data
kills = load.load_all_sorted_match('../assignment-final-data/kills.txt')

# Load cheaters data 
cheaters = load.load_first_two_lines('../assignment-final-data/cheaters.txt')

# Create dictionary of cheaters and date they started cheating with cheater as key and date as value
cheat_date_dic = wrangle.get_dict(cheaters)

# Add dummy variable to kills data for whether victim becomes a cheater after being killed by a cheater
kill_cheat = wrangle.add_dummies(kills, cheat_date_dic)

# Calculate number of observed instances of victim becoming a cheater after being killed by a cheater.
part_2_observed_count = calculate.count_1(kill_cheat, 6)

print("There are", part_2_observed_count, "instances of a victim becoming a cheater after being killed by a cheater.")

There are 47 instances of a victim becoming a cheater after being killed by a cheater.


In [5]:
# 2.2 Calculate the expected number of instances of a victim becoming a cheater after being killed by a cheater.

# Make Dictionaries of match_id and player_id for cheaters and victims.
# These are constant and thus are kept in a different code block. I will also use these for part 3.
 
match_killer_ids = wrangle.get_dict_mult_values(kill_cheat)
match_victim_ids = wrangle.get_dict_mult_values_2(kill_cheat)
match_ids = match_killer_ids.keys()  

In [6]:
# I have chosen to run a for loop 20 times and keep the functions visible to this level of modularity for legibility and understanding of logic. 
# I could have wrapped this all up in one function named, for example, 'part 3,' 
# but I feel like that hides the logic and makes it harder to understand.

part_2_random = []

for i in range(20):
    
    # Generate list of all players per match and a randomised duplicate of this list, so that I can swap player fates.
    total_players, total_players_ref = wrangle.get_dict_unique_totals_random(match_ids, match_killer_ids, match_victim_ids)
    
    # Swap fates of players in each match with full shuffle, and remake equivalent with new fates.
    rand_kills = wrangle.swap_by_index_2_elements(kill_cheat, total_players, total_players_ref, match_killer_ids, match_victim_ids)

    # Add dummy variable to randomised kills data to show whether victim becomes a cheater after being killed by a cheater
    rand_kill_cheat = wrangle.add_dummies(rand_kills, cheat_date_dic)
    
    # Count number of instances of victim becoming a cheater after being killed by a cheater in randomised match.
    part_2_random_count = calculate.count_1(rand_kill_cheat, 6)

    # Add counted outcomes to final list
    part_2_random.append(part_2_random_count)
    
part_2_stats = calculate.get_stats_list(part_2_random)

print(part_2_stats)

print("There is an expected count of", part_2_stats[0], 
      "instances of victims becoming cheaters after being killed by a cheater, with a lower 95% confidence interval of", 
      part_2_stats[3], 
      "and an upper 95% confidence interval of", part_2_stats[2]) 

[12.2, 2.930780388260186, 13.484471138353507, 10.915528861646491]
There is an expected count of 12.2 instances of victims becoming cheaters after being killed by a cheater, with a lower 95% confidence interval of 10.915528861646491 and an upper 95% confidence interval of 13.484471138353507


__Part 2: Conclusions__

The function of this analysis is to test whether victims of cheating tend to begin to cheat. In this case, the randomisation of teams 20 times acts as the null hypothesis against which our observed values have been tested. This null hypothesis is that victims of cheating do not then go on to cheat themselves. We are only able to reject the null hypothesis at 0.05 statistical significance if the observed count of victims who became cheaters after witnessing cheating, 47 in this case, is outside of the 95% confidence intervals.

47 does fall outside the confidence intervals of my randomisation analysis. My upper confidence interval generally stays below 15. The expected number of victims of cheating who become cheaters under the null hypothesis is around 11. I ran the 20 randomisations 10 times to observe these expectations. Thus, we can reject the null hypothesis and conclude that yes, according to our analysis, victims of cheating do tend to go on and cheat themselves as a result.   

### 3. Do observers of cheating start cheating?

Use the files `cheaters.txt` and `kills.txt` to count how many players observed an active cheater on at least one occasion and then started cheating. Cheating players can be recognized because they exhibit abnormal killing patterns. We will assume that player A realizes that player B cheats if:

1. Player B has started cheating but player A is not cheating.
2. Player B kills at least 3 other players before player A gets killed in the game.
3. At some point afterwards, player A starts cheating.

Output the count in the data.

Then, use the 20 randomizations from Part 2 to estimate the expected count of observers of cheating who start cheating. Output the mean and the 95% confidence interval for the expected count in these randomized worlds.

*Optional: Conclude in a short comment what you observe. This reflection is optional and will not be marked.*

In [7]:
# 3.1 Calculate the observed number of instances of a victim becoming a cheater after observing a cheater after the cheater has killed 3 people. 

# Load kills data for part 3, sorted by match and date
kills_3 = load.load_all_sorted_match_date('../assignment-final-data/kills.txt')
        
# Load cheaters data for part 3
cheaters_3 = load.load_first_two_lines('../assignment-final-data/cheaters.txt')

# Create dictionary of cheaters and date they started cheating
cheat_date_dic = wrangle.get_dict(cheaters_3)

# Add dummy variable for whether killer was cheater at the time of the kill
kills_cheat_3 = wrangle.add_dummy(kills_3, cheat_date_dic)

# Create dictionary of all kills per match with match_id as key and list of kills information as value 
all_kills_per_match = wrangle.make_complete_dictionary(kills_cheat_3)

# Extract all kills after the 3rd kill by a killer in each match. This contains ALL info per kill.
kills_after = wrangle.extract_entries_per_key_after_point(all_kills_per_match, 3)

# Extract victim_id and time of death from kills_after
observers = wrangle.extract_2nd_3rd_element_lolol(kills_after)

# Add dummy for whether the observer became a cheater.
observers_2 = wrangle.add_dummy_2(observers, cheat_date_dic) # add wherther observer became cheater

# Create dictionary that adds observer player_id as key if they became a cheater after observing a cheater kill 3 people.
observers_turned_cheaters_dict = wrangle.create_dict_conditional_value(observers_2, 1)

print("There are", len(observers_turned_cheaters_dict.keys()), "instances of observers becoming cheaters after observing a cheater kill 3 people.")

There are 213 instances of observers becoming cheaters after observing a cheater kill 3 people.


In [8]:
# Part 3.2 - Testing the null hypothesis 
# 3.2 Calculate the expected number of instances of a victim becoming a cheater after observing a cheater after the cheater has killed 3 people.    
# I have chosen to run a for loop 20 times and keep the functions visible to this level of modularity for legibility and understanding of logic. 
# I could have wrapped this all up in one function named, for example, 'part 3,' 
# but I feel like that hides the logic and makes it harder to understand.


part_3_random = []

for i in range(20):

    # Generate list of all players per match and a randomised duplicate of this list, so that I can swap player fates.
    total_players_3, total_players_rand = wrangle.get_dict_unique_totals_random(match_ids, match_killer_ids, match_victim_ids)
    
    # Swap fates of players in each match with full shuffle, and remake kill_cheat equivalent with new fates.
    rand_kills_3_swapped = wrangle.swap_by_index_2_elements(kills_cheat_3, total_players_3, total_players_rand, match_killer_ids, match_victim_ids)
    
    # Add dummy variable for whether killer is a cheater
    rand_kills_cheat_3_swapped = wrangle.add_dummy(rand_kills_3_swapped, cheat_date_dic)
    
    # Create dictionary of kills per match with all randomised kill information.
    all_kills_per_match_random = wrangle.make_complete_dictionary(rand_kills_cheat_3_swapped)
    
    # Extract kills information after the 3rd kill by a killer in each match. This contains ALL info per kill.
    kills_after_random = wrangle.extract_entries_per_key_after_point(all_kills_per_match_random, 3)
    
    # Extract victim_id and time of death from kills_after_random
    observers_random = wrangle.extract_2nd_3rd_element_lolol(kills_after_random)
    
    # Add dummy for whether observer became cheater
    observers_2_random = wrangle.add_dummy_2(observers_random, cheat_date_dic) 
    
    # Create dictionary that adds observer player_id as key if they became a cheater after observing a cheater kill 3 people.
    observers_turned_cheaters_dict_random = wrangle.create_dict_conditional_value(observers_2_random, 1)
    
    # Get number of observers turned cheaters
    num_random_observers_turned_cheaters = len(observers_turned_cheaters_dict_random.keys())
    
    part_3_random.append(num_random_observers_turned_cheaters)

part_3_stats = calculate.get_stats_list(part_3_random)

print(part_3_stats)

print("There is an expected count of", part_3_stats[0], 
      "instances of victims becoming cheaters after observing a cheater after a cheater has got 3 kills in that game,  with a lower 95% confidence interval of", 
      part_3_stats[3], 
      "and an upper 95% confidence interval of", part_3_stats[2]) 

[47.3, 7.616464160999477, 50.6380625959884, 43.96193740401159]
There is an expected count of 47.3 instances of victims becoming cheaters after observing a cheater after a cheater has got 3 kills in that game,  with a lower 95% confidence interval of 43.96193740401159 and an upper 95% confidence interval of 50.6380625959884


__Part 3: Conclusions__

The function of this analysis is to test whether people who observe cheaters tend to cheat afterwards. In this case, the randomisation of teams 20 times acts as the null hypothesis against which our observed values have been tested. This null hypothesis is that observers of cheating are not then more likely to cheat. We are only able to reject the null hypothesis at 0.05 statistical significance if the observed count of observers who became cheaters, 213 in this case, is outside of the 95% confidence intervals.

213 does fall outside the confidence intervals of my randomisation analysis. My upper confidence interval generally stays below 50. The expected number of victims of cheating who become cheaters under the null hypothesis is around 47. I ran the 20 randomisations 10 times to observe these expectations. Thus, we can reject the null hypothesis and conclude that yes, according to our analysis, observers of cheating are likely to become cheaters themselves.   

---

### Evaluation

| Aspect         | Mark     | Comment   
|:--------------:|:--------:|:----------------------
| Code runs      |   /20    |              
| Output 1       |   /10    | 
| Output 2       |   /10    | 
| Output 3       |   /10    | 
| Legibility     |   /10    | 
| Modularity     |   /10    | 
| Optimization   |   /30    | 
| **Total**      |**/100**  | 
