## Extracting the list of tournaments

I will try to get a list of tournaments and add the dates when each of tournaments happened.  Also, I would like to add tournament tier information if possible.

In [13]:
import pandas as pd

In [14]:
agents_pick_rates = pd.read_csv("../data/vct_2021/agents/agents_pick_rates.csv")

In [15]:
draft_phase = pd.read_csv("../data/vct_2021/matches/draft_phase.csv")
kills_stats = pd.read_csv("../data/vct_2021/matches/kills_stats.csv")
win_loss_methods_count = pd.read_csv("../data/vct_2021/matches/win_loss_methods_count.csv")

  kills_stats = pd.read_csv("../data/vct_2021/matches/kills_stats.csv")


In [16]:
tournament_list_2021_v1 = agents_pick_rates.Tournament.unique()
tournament_list_2021_v2 = draft_phase.Tournament.unique()
tournament_list_2021_v2_2 = kills_stats.Tournament.unique()
tournament_list_2021_v2_3 = win_loss_methods_count.Tournament.unique()

print(len(tournament_list_2021_v1), len(tournament_list_2021_v2), len(tournament_list_2021_v2_2), len(tournament_list_2021_v2_3))

for tournament in tournament_list_2021_v1:
    if tournament not in tournament_list_2021_v2_3:
        print(tournament)

142 79 101 142


In [17]:
in_v2_but_not_in_v2_2 = []
for tournament in tournament_list_2021_v2:
    if tournament not in tournament_list_2021_v2_2:
        in_v2_but_not_in_v2_2.append(tournament)

print(in_v2_but_not_in_v2_2)
print(len(in_v2_but_not_in_v2_2))

['Champions Tour North America Stage 1: Masters']
1


We saw above that the number of tournaments from each dataframe is `not` consistent.  For now, my best attempt would be:

- first: make a list of tournaments from each dataframe obtained from csv files in the folder, matches.

- second: gather all the lists and take the intersection and see how many tournaments it keeps.

In [18]:
# make the dictionary whose keys are file names and values are 1-dimensional ndarray.
# we don't need teampping here because it doesn't have "Tournament" feature.
data_files = ["draft_phase", "eco_rounds", "eco_stats", "kills_stats",
              "kills", "maps_played", "maps_scores", "overview", "rounds_kills",
              "scores", "win_loss_methods_count", "win_loss_methods_round_number"]
tournaments = {}
for file in data_files:
    filename = f"../data/vct_2021/matches/{file}.csv"
    tournaments[file] = pd.read_csv(filename).Tournament.unique()

  tournaments[file] = pd.read_csv(filename).Tournament.unique()
  tournaments[file] = pd.read_csv(filename).Tournament.unique()


In [19]:
print("The number of unique tournaments of each file.")
for file, df in tournaments.items():
    print(f"{file}: {len(df)}")

The number of unique tournaments of each file.
draft_phase: 79
eco_rounds: 101
eco_stats: 101
kills_stats: 101
kills: 101
maps_played: 142
maps_scores: 142
overview: 142
rounds_kills: 101
scores: 142
win_loss_methods_count: 142
win_loss_methods_round_number: 142


I would like to organize these files with respect to the number of unique tournaments.

In [20]:
files_by_tnmt_len = {}

for file, df in tournaments.items():
    if len(df) in files_by_tnmt_len:
        files_by_tnmt_len[len(df)].append(file)
    else:
        files_by_tnmt_len[len(df)] = [file]

files_by_tnmt_len

{79: ['draft_phase'],
 101: ['eco_rounds', 'eco_stats', 'kills_stats', 'kills', 'rounds_kills'],
 142: ['maps_played',
  'maps_scores',
  'overview',
  'scores',
  'win_loss_methods_count',
  'win_loss_methods_round_number']}

My goal in this note is to find the list of tournaments that every file has or to figure out what tournaments are missing from each file.

- `Claim`: If "file1" and "file2" have the same length of unique tournaments, then they have the same unique tournaments.

Let me first define a funtion which will be used to show this claim.

In [21]:
def not_in_the_other(a: list, b: list) -> list: 
    '''
    input: a pair of lists
    return: the list of elements which are in the list a, but not in b.
    '''
    not_in_b = []
    for i in a:
        if i not in b:
            not_in_b.append(i)
            
    return not_in_b

def all_pairs(k: int) -> list:
    '''
    input: a postive integer k bigger than 1
    return: list of all pairs from 0,1,2,...,k-1.  Two elements in a pair are different.
    '''
    if k < 2:
        return print("input needs to be an integer bigger than 1")
    
    else:
        pairs = []
        for i in range(k):
            n = i
            m = i+1
            while m < k:
                pairs.append((n,m))
                m += 1
        return pairs

In [22]:
for length, file_list in files_by_tnmt_len.items():
    print(f"Check if all the files, having {length} many tournaments, share the same tournaments.")
    if len(file_list) == 1:
        print(f"There is only one file with {length} many tournaments.")
    else:
        pair_ind = all_pairs(len(file_list))
        diff_pair_ind = []

        for i, j in pair_ind:
            not_in_intersection = not_in_the_other(tournaments[file_list[i]], tournaments[file_list[j]])
            if not_in_intersection != []:
                print(not_in_intersection)
                diff_pair_ind.append((i, j))

        if len(diff_pair_ind) > 0:
            for i, j in diff_pair_ind:
                print(f"{file_list[i]} and {file_list[j]} have different tournaments.")
        else:
            print("All files have the same tournaments as we want.")
    print("====================")

Check if all the files, having 79 many tournaments, share the same tournaments.
There is only one file with 79 many tournaments.
Check if all the files, having 101 many tournaments, share the same tournaments.
All files have the same tournaments as we want.
Check if all the files, having 142 many tournaments, share the same tournaments.
All files have the same tournaments as we want.


The above observation boils down to the three cases: 79, 101, 142 many tournaments.

I will check below if those 79 tournaments are the subset of 101 tournaments and 101 tournaments are the subset of 142 tournaments.

To do this, we can simply used the funtion "not_in_the_other"

Remark: "draft_phase" has 79, "eco_rounds" has 101 and "maps_played" has 142 tournaments, respectively.

In [23]:
### 79 tournaments are in 101 tournaments?
missing_tnmt = not_in_the_other(tournaments["draft_phase"], tournaments["eco_rounds"])
if missing_tnmt:
    print(missing_tnmt, "is not in 101 many tournaments.")
else:
    print("We are good!")

['Champions Tour North America Stage 1: Masters'] is not in 101 many tournaments.


In [24]:
### 101 tournaments are in 142 tournaments?
missing_tnmt = not_in_the_other(tournaments["eco_rounds"], tournaments["maps_played"])
if missing_tnmt:
    print(missing_tnmt, "is not in 101 many tournaments.")
else:
    print("We are good!")

We are good!


### Conclusion:
'Champions Tour North America Stage 1: Masters' is not in 101 many tournaments.  Thus, we can say we have "full" data for 78 tournaments.\
I checked the source website and saw there were total 142 tournaments, so the person who made this kaggle data either forgot to scrape some data or couldn't find them from the source website.\
It's possible that some features, such as "draft_phase", "eco_rounds", "kills_stats", were not provided for some tournaments.\
However, 'Champions Tour North America Stage 1: Masters' was a one of the bigger tournaments, so we should be able to find those missing data.

#### We have two options:
- work with 79 tournaments (after finding missing data for 'Champions Tour North America Stage 1: Masters') or

- find the data of the (142-79 = 63) missing tournaments from the source website or somewhere else.