# P2: Project proposal and initial analyses
When you are done with Homework H1, you will continue to work on the next project milestone. In Milestone P2, together with your team members, you will agree on and refine your project proposal. Your first task is to select a project. Even though we provide the datasets for you to use, at this juncture, it is your responsibility to perform initial analyses and verify that what you propose is feasible given the data (including any additional data you might bring in yourself), which is crucial for the success of the project.

The goal of this milestone is to intimately acquaint yourself with the data, preprocess it, and complete all the necessary descriptive statistics tasks. We expect you to have a pipeline in place, fully documented in a notebook, and show us that you have clear project goals.

When describing the relevant aspects of the data, and any other datasets you may intend to use, you should in particular show (non-exhaustive list):

- That you can handle the data in its size.
- That you understand what’s in the data (formats, distributions, missing values, correlations, etc.).
- That you considered ways to enrich, filter, transform the data according to your needs.
- That you have a reasonable plan and ideas for methods you’re going to use, giving their essential mathematical details in the notebook.
- That your plan for analysis and communication is reasonable and sound, potentially discussing alternatives to your choices that you considered but dropped.

We will evaluate this milestone according to how well these steps have been done and documented, the quality of the code and its documentation, the feasibility and critical awareness of the project. We will also evaluate this milestone according to how clear, reasonable, and well thought-through the project idea is. Please use the second milestone to really check with us that everything is in order with your project (idea, feasibility, etc.) before you advance too much with the final Milestone P3! There will be project office hours dedicated to helping you.

You will work in a public GitHub repository dedicated to your project, which can be created by following this link. The repository will automatically be named ada-2023-project-<your_team_name>. By the Milestone P2 deadline, each team should have a single public GitHub repo under the epfl-ada GitHub organization, containing the project proposal and initial analysis code.

## P2 deliverable (done as a team): GitHub repository with the following:

- Readme.md file containing the detailed project proposal (up to 1000 words). Your README.md should contain:
  - Title
  - bstract: A 150 word description of the project idea and goals. What’s the motivation behind your project? What story would you like to tell, and why?
  - Research Questions: A list of research questions you would like to address during the project.
  - Proposed additional datasets (if any): List the additional dataset(s) you want to use (if any), and some ideas on how you expect to get, manage, process, and enrich it/them. Show us that you’ve read the docs and some examples, and that you have a clear idea on what to expect. Discuss data size and format if relevant. It is your responsibility to check that what you propose is feasible.
  - Methods
  - Proposed timeline
  - Organization within the team: A list of internal milestones up until project Milestone P3.
  - Questions for TAs (optional): Add here any questions you have for us related to the proposed project.
- Notebook containing initial analyses and data handling pipelines. We will grade the correctness, quality of code, and quality of textual descriptions.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt

In [8]:
# Import the data with the correct format
data_folder = '../Data'
paths_folder = data_folder + '/wikispeedia_paths-and-graph'
columns_f = ['hashedIpAddress','timestamp','durationInSec','path','rating']
columns_u = ['hashedIpAddress','timestamp','durationInSec','path','target', 'type']

finished_paths = paths_folder + '/paths_finished.tsv'
unfinished_paths = paths_folder + '/paths_unfinished.tsv'
df_success = pd.read_csv(finished_paths,comment='#',sep='\t',names=columns_f)
df_fail = pd.read_csv(unfinished_paths,comment='#',sep='\t',names=columns_u)
display(df_fail)

Unnamed: 0,hashedIpAddress,timestamp,durationInSec,path,target,type
0,2426091a53125110,1297054935,1804,Obi-Wan_Kenobi,Microsoft,timeout
1,26141fd878806294,1297055651,1805,Julius_Caesar,Caracas,timeout
2,2b015fb8181c48f2,1297090819,1818,Malawi;Democracy;Alexander_the_Great,First_Crusade,timeout
3,53a53bc244e08a6a,1297094761,49,Paraguay,Mount_St._Helens,restart
4,53a53bc244e08a6a,1297099105,1808,Paraguay;Bolivia,Mount_St._Helens,timeout


The failed paths dataset is parasited by games that were abandonned from the start. While one might consider that this might be the player giving up immediately and clicking off due to difficulty, we could also say it is because said player simply didn't *want* to play. It is simpler to just filter out all games that timed out within the first click.

As for those that ended in a restart, they're still valuable, as they might mean that the player is simply stuck, and we know the player *still wants to play.*

In [3]:
df_fail = df_fail[~((df_fail.type == 'timeout') &
                    (~df_fail['path'].str.contains(';')))].copy()   #find() returns -1 if not in str. Not finding a ; means player stayed at start
                                                                                                #Writing the inverse condition with a not is more clear
display(df_fail)

In [4]:
# change unix timestamp code to human readable
#df_success['timestamp'] = df_success['timestamp'].apply(lambda x: pd.to_datetime(x, unit='s'))
#df_fail['timestamp']    = df_fail['timestamp'].apply(lambda x: pd.to_datetime(x, unit='s'))


#TODO do this cell once, and figure out how to write it to a tsv in a clean format

In [5]:
df_success

In [12]:
paths_success = df_success.path.apply(lambda x: x.split(';')).copy()
df_fail.path = df_fail.path.map(str) + ';' + df_fail.target
paths_failed  = df_fail.path.apply(lambda x: x.split(';'))
paths_filtered_success = paths_success.apply(lambda x: x[0] +' -> '+ x[len(x)-1]).copy()
paths_filtered_fail = paths_failed.apply(lambda x: x[0] +' -> '+ x[len(x)-1]).copy()
paths_filtered_total = pd.concat([paths_filtered_success, paths_filtered_fail])

array(['14th_century -> African_slave_trade', '14th_century -> Greece',
       '14th_century -> John_F._Kennedy', ...,
       'Yagan -> U.S._Open_%28tennis%29',
       'Yarralumla%2C_Australian_Capital_Territory -> Abraham_Lincoln',
       'Ziad_Jarrah -> Jew'], dtype=object)

In [7]:
game_reps_total=paths_filtered_total.value_counts()
display(game_reps_total)
game_reps_success=paths_filtered_success.value_counts()
game_reps_fail = paths_filtered_fail.value_counts()

path
Brain -> Telephone                          2044
Theatre -> Zebra                            1777
Asteroid -> Viking                          1770
Pyramid -> Bean                             1586
Batman -> Wood                               223
                                            ... 
Arctic_Monkeys -> Mediterranean_Sea            1
Archaeoastronomy -> Pig                        1
Archaeoastronomy -> Laos                       1
Archaeoastronomy -> Battle_of_Leyte_Gulf       1
Jamaica -> Alan_Turing                         1
Name: count, Length: 42631, dtype: int64