### MY470 Computer Programming

### Final Assignment, MT 2022

#### \*\*\* Due 12:00 noon on Monday, January 23, 2023 \*\*\*

---
## The homophily and social contagion of cheating

The final assignment asks you to use the computational thinking and programming skills you learned in the course to answer an empirical social science question. You are expected to apply the best practices and theoretical concepts we covered in the course to produce a program that not only returns the correct output but is also legible, modular, and reasonably optimized. The assignment assumes mastery of loops, conditionals, and functions, as well as awareness of issues related to runtime performance.

In the assignment, we will study the homophily and social contagion of cheating in the massive multiplayer online game PlayerUnknown's Battlegrounds (PUBG). Cheating in this context means the adoption of unapproved software that gives the player an unfair advantage in the game (e.g. being able to see through walls). 

Our hypotheses are that cheaters tend to associate with other cheaters but also, players who interact with cheaters become likely to adopt cheating themselves. To provide preliminary evidence for these hypotheses, we will:

1. Observe whether cheaters tend to team up with other cheaters more than chance.
2. Observe whether players who observe cheaters are likely to become cheaters more than chance.
3. Observe whether players who are killed by cheaters are likely to become cheaters more than chance.

To test the "more than chance" part, we will simulate alternative universes in which the players played the same game but joined a different team or happened to be killed by someone else at a different time. We will then compare how what we observe in the actual data compares to what we would expect in a "randomized" world.  

**NOTE: You are only allowed to use fundamental Python data types (lists, tuples, dictionaries, numpy.ndarray, etc.) to complete this assignment.** You are not allowed to use advanced data querying and data analysis packages such as pandas, sqlite, networkx, or similar. We impose this restriction in order to test your grasp of fundamental programming concepts, not your scripting experience with Python libraries from before or from other courses you may be taking. 

#### Hints

Although this assignment is quite streamlined, imagine that the tasks here are part of a larger project. How would you structure your program if in the future you may need to use a different dataset with similar structure, manipulate the data differently, add additional analyses, or modify the focus of the current analysis?  

Keep different data manipulations in separate functions/methods and group related functions/classes in separate `.py` files. Name your modules in an informative way.

### Data

You will find the data in the repository [https://github.com/lse-my470/assignment-final-data.git](https://github.com/lse-my470/assignment-final-data.git). Please clone the data repository in the same directory where you clone the repository `assignment-final-yourgithubname`. Keep the name for the data folder `assignment-final-data`. Any time when you refer to the data in your code, please use a relative path such as `'../assignment-final-data/filename.txt'` instead of an absolute path such as `'/Users/myname/Documents/my470/assignment-final-data/filename.txt'`. This way, we will be able to test your submission with our own copy of the data without having to modify your code.

The data were collected by Jinny Kim (LSE MSc ASDS '19). The repository contains the following files:

* `cheaters.txt` – contains cheaters who played between March 1 and March 10, 2019
    1. player account id
    2. estimated date when the player started cheating
    3. date when the player's account was banned due to cheating


* `kills.txt` – contains the killings done in 6,000 randomly selected matches played between March 1 and March 10, 2019
    1. match id 
    2. account id of the killer
    3. account id of the player who got killed
    4. time when the kill happened
 
 
* `team_ids.txt` – contains the team ids for players in 5,419 team-play matches in the same period. If a match from the kills.txt file does not appear in these data, we will assume that it was in single-player mode.  
    1. match id 
    2. player account id
    3. team id in match
    
You should not modify the original data in any way. Similarly, you should not duplicate the data in this repository but instead use a relative path to access them.

### Output

The tasks ask you to output actual counts and expecteded counts (mean with 95% confidence interval). To estimate the 95% conifdence intervals, ignore the small sample size and the fact that we are dealing with count data, and simply use the approximation: 95% CI $= \mu \pm 1.96 \frac{\sigma}{\sqrt{n}}$, where $\mu$ is the mean and $\sigma$ the standard deviation of the counts in the $n=20$ randomizations. You are free to use `statsmodels` or `numpy` to calculate these values.


#### Hints

When writing your code, test it on a small "toy dataset", instead of the entire data. This way, you won't need to wait for minutes/hours just to find out that you have a syntax error!

If the randomization is time consuming, it may be worth finding a way to save the data you generate on hard disk so that you don't need to run the randomization again and again. If you decide to do so, please write your code to save any such files with processed data in the directory where this file resides. This way, we can run your code without having to alter it.

If you need to save any new data, think carefully about the most efficient way, both in terms of time and space, to save them.

## Import and run your code here

Keep your code in separate `.py` files and then import it in the code cell below. In the subsequent cells, call the functions/methods you need to conduct the requested analyses. We should be able to run all cells here to calculate again the results and get the requested output, without having to modify your code in any way. 

In [1]:
# Import modules here

from reading_files import *

from cheaters_teaming_up import *
from cheaters_interactions import *

from team_randomization import *
from kills_randomization import *

from simulations_cheating import *

### 1. Do cheaters team up?

Use the files `cheaters.txt` and `team_ids.txt` to estimate how often cheaters (regardless of when exactly they started cheating) end up on the same team. Your output should say how many teams have 0, 1, 2, or 4 cheaters.

Now, randomly shuffle the team ids among the players in a match. Repeat this 20 times and estimate the expected counts as before. Output the mean and the 95% confidence intervals for the expected counts. 

*Optional: Conclude in a short comment what you observe. This reflection is optional and will not be marked.*

In [2]:
# Output answers here

cheaters = get_cheaters()
teams = get_teams()

zero_cheaters, one_cheater, two_cheaters, three_cheaters, four_cheaters = get_cheater_counters(cheaters, teams)

print('The number of teams with zero cheaters is: ', zero_cheaters)
print('The number of teams with one cheater is: ', one_cheater)
print('The number of teams with two cheaters is: ', two_cheaters)
print('The number of teams with three cheaters is: ', three_cheaters)
print('The number of teams with four cheaters is: ', four_cheaters)
print()

ci_zero_cheaters, mean_zero_cheaters, ci_one_cheater, mean_one_cheater, \
  ci_two_cheaters, mean_two_cheaters, ci_three_cheaters, mean_three_cheaters, \
    ci_four_cheaters, mean_four_cheaters = cheaters_teaming_up_simulation(20)

print('The mean and confidence interval for number of teams with zero cheaters is: ', \
      mean_zero_cheaters, '|', ci_zero_cheaters)
print('The mean and confidence interval for number of teams with one cheater is: ', \
      mean_one_cheater, '|', ci_one_cheater)
print('The mean and confidence interval for number of teams with two cheaters is: ', \
      mean_two_cheaters, '|', ci_two_cheaters)
print('The mean and confidence interval for number of teams with three cheaters is: ', \
      mean_three_cheaters, '|', ci_three_cheaters)
print('The mean and confidence interval for number of teams with four cheaters is: ', \
      mean_four_cheaters, '|', ci_four_cheaters)
print()

print('By comparing the actual results observed in data with the values obtained through our simulation, ' \
      'we understand that the actual results clearly fall outside the estimated confidence intervals .')
print()
print('Since these confidence intervals correspond to the values we would expect to observe in case team ' \
      'distribution was random among players, the data indicates some non-random effect is at play.')
print()
print('Therefore, the data appears to confirm that cheating players do team up. The number of teams with ' \
      'only one cheater (3199) is well below the lower bound of the confidence interval (3526).' \
      'Simultaneously, the number of teams with more than one cheater (in which cheaters "teamed-up") is ' \
      'above our expectations from the randomized simulations, for all the cases of 2, 3, or 4 cheaters.)')

The number of teams with zero cheaters is:  170782
The number of teams with one cheater is:  3199
The number of teams with two cheaters is:  182
The number of teams with three cheaters is:  9
The number of teams with four cheaters is:  2

The mean and confidence interval for number of teams with zero cheaters is:  170610.7 | [170608.1 : 170613.3]
The mean and confidence interval for number of teams with one cheater is:  3528.85 | [3523.6 : 3534.1]
The mean and confidence interval for number of teams with two cheaters is:  34.2 | [31.5 : 36.9]
The mean and confidence interval for number of teams with three cheaters is:  0.25 | [0.1 : 0.4]
The mean and confidence interval for number of teams with four cheaters is:  0.0 | [0.0 : 0.0]

By comparing the actual results observed in data with the values obtained through our simulation, we understand that the actual results clearly fall outside the estimated confidence intervals .

Since these confidence intervals correspond to the values we wo

### 2. Do victims of cheating start cheating?

Use the files `cheaters.txt` and `kills.txt` to count how many players got killed by an active cheater on at least one occasion and then started cheating. Specifically, we are interested in situations where:

1. Player B has started cheating but player A is not cheating.
2. Player B kills player A.
3. At some point afterwards, player A starts cheating.

Output the count in the data. 

Then, simulate alternative worlds in which everything is the same but the events took somewhat different sequence. To do so, randomize within a game, keeping the timing and structure of interactions but shuffling the player ids. Generate 20 randomizations like this and estimate the expected count of victims of cheating who start cheating as before. Output the mean and the 95% confidence interval for the expected count in these randomized worlds.

*Optional: Conclude in a short comment what you observe. This reflection is optional and will not be marked.*

#### Hint

Starting time of cheating is estimated as a date, so assume that a player cheats on any match that they started playing on that date or afterwards. Use the match starting date so that if the match started before midnight of the cheating date but ended after midnight, we will assume that the player was not cheating just yet. 


In [3]:
# Output answers here

cheaters = get_cheaters()
kills = get_kills()

matches_start = match_starting_time(kills)
nr_victim_cheaters = counter_victim_cheaters(kills, matches_start, cheaters)

print("The total number of cases where a player started cheating after being killed by a cheating player is: ", \
      nr_victim_cheaters)
print()

ci_victim_cheaters, mean_victim_cheaters = victim_cheaters_simulation(20)

print('The mean and confidence interval for "victim cheaters" is: ', \
      mean_victim_cheaters, '|', ci_victim_cheaters)
print()

print('By comparing the actual results observed in data with the values obtained through our simulation, ' \
      'we understand that the actual results clearly fall outside the confidence interval estimated.')
print()
print('Since the confidence interval corresponds to the values we would expect to observe in cheating behaviour ' \
      'was random among players (and independent of victimhood) the data indicates some non-random effect is at play.')
print()
print('Therefore, the data appears to confirm a pattern in which cheating players were killed by ' \
      'cheating players before starting cheating. It suggests that being killed by a cheating player ' \
      'can lead a player to start cheating.')

The total number of cases where a player started cheating after being killed by a cheating player is:  47

The mean and confidence interval for "victim cheaters" is:  14.35 | [12.3 : 16.4]

By comparing the actual results observed in data with the values obtained through our simulation, we understand that the actual results clearly fall outside the confidence interval estimated.

Since the confidence interval corresponds to the values we would expect to observe in cheating behaviour was random among players (and independent of victimhood) the data indicates some non-random effect is at play.

Therefore, the data appears to confirm a pattern in which cheating players were killed by cheating players before starting cheating. It suggests that being killed by a cheating player can lead a player to start cheating.


### 3. Do observers of cheating start cheating?

Use the files `cheaters.txt` and `kills.txt` to count how many players observed an active cheater on at least one occasion and then started cheating. Cheating players can be recognized because they exhibit abnormal killing patterns. We will assume that player A realizes that player B cheats if:

1. Player B has started cheating but player A is not cheating.
2. Player B kills at least 3 other players before player A gets killed in the game.
3. At some point afterwards, player A starts cheating.

Output the count in the data.

Then, use the 20 randomizations from Part 2 to estimate the expected count of observers of cheating who start cheating. Output the mean and the 95% confidence interval for the expected count in these randomized worlds.

*Optional: Conclude in a short comment what you observe. This reflection is optional and will not be marked.*

In [4]:
# Output answers here

cheaters = get_cheaters()
kills = get_kills()

nr_observer_cheaters = get_observer_cheaters(cheaters, kills)

print("The total number of cases where a player started cheating after observing a cheating player is: ", \
      nr_observer_cheaters)
print()


ci_observer_cheaters, mean_observer_cheaters = observer_cheaters_simulation(20)

print('The mean and confidence interval for "observer cheaters" is: ', \
      mean_observer_cheaters, '|', ci_observer_cheaters)
print()

print('By comparing the actual results observed in data with the values obtained through our simulation, ' \
      'we understand that the actual results clearly fall outside the confidence intervals estimated.')
print()
print('Since these confidence intervals correspond to the values we would expect to observe in case team ' \
      'distribution was random among players, the data indicates some non-random effect is at play.')
print()
print('Therefore, the data appears to confirm a pattern in which cheating players were killed by ' \
      'cheating players before starting cheating. It suggests that being killed by a cheating player ' \
      'can lead a player to start cheating. ')

The total number of cases where a player started cheating after observing a cheating player is:  260

The mean and confidence interval for "observer cheaters" is:  155.3 | [150.0 : 160.6]

By comparing the actual results observed in data with the values obtained through our simulation, we understand that the actual results clearly fall outside the confidence intervals estimated.

Since these confidence intervals correspond to the values we would expect to observe in case team distribution was random among players, the data indicates some non-random effect is at play.

Therefore, the data appears to confirm a pattern in which cheating players were killed by cheating players before starting cheating. It suggests that being killed by a cheating player can lead a player to start cheating. 


---

### Evaluation

| Aspect         | Mark     | Comment   
|:--------------:|:--------:|:----------------------
| Code runs      |   /20    |              
| Output 1       |   /10    | 
| Output 2       |   /10    | 
| Output 3       |   /10    | 
| Legibility     |   /10    | 
| Modularity     |   /10    | 
| Optimization   |   /30    | 
| **Total**      |**/100**  | 
