# Data Analysis
Table of Contents:
* [Importing Data](#one)
* [Data Manipulation](#two)

In [3]:
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## 1. Importing Data <a class="anchor" id="one"></a>
For each scenario, create data frame to organize which csv files belong to which scenario and test user <br>
In order for the following code blocks to run, change variable `prefix` below to the file path stored in your computer for this project

In [6]:
prefix = "/Users/carolinejung/tiktok-like-experiment/" # CHANGE ME!

In [7]:
sc2_files = pd.DataFrame(
    [{"run": 1, "user": "control", "videos": "all", "path": prefix+"data/-1_Sec02Gr2Sc2Cntrl_CJ_02-12-19-31_like_by_hashtag_data_all_videos.csv"},
     {"run": 1, "user": "control", "videos": "liked", "path": prefix+"data/-1_Sec02Gr2Sc2Cntrl_CJ_02-12-19-31_like_by_hashtag_data_liked_videos.csv"},
     {"run": 1, "user": "active", "videos": "all", "path": prefix+"data/2_Sec02Gr2Sc2Activ_CJ_02-12-19-30_like_by_control_data_all_videos.csv"},
     {"run": 1, "user": "active", "videos": "liked", "path": prefix+"data/2_Sec02Gr2Sc2Activ_CJ_02-12-19-30_like_by_control_data_liked_videos.csv"},

     {"run": 2, "user": "control", "videos": "all", "path": prefix+"data/-1_Sec02Gr2Sc2Cntrl_CJ_02-12-19-40_like_by_hashtag_data_all_videos.csv"},
     {"run": 2, "user": "control", "videos": "liked", "path": prefix+"data/-1_Sec02Gr2Sc2Cntrl_CJ_02-12-19-40_like_by_hashtag_data_liked_videos.csv"},
     {"run": 2, "user": "active", "videos": "all", "path": prefix+"data/2_Sec02Gr2Sc2Activ_CJ_02-12-19-39_like_by_control_data_all_videos.csv"},
     {"run": 2, "user": "active", "videos": "liked", "path": prefix+"data/2_Sec02Gr2Sc2Activ_CJ_02-12-19-39_like_by_control_data_liked_videos.csv"},

     {"run": 3, "user": "control", "videos": "all", "path": prefix+"data/-1_Sec02Gr2Sc2Cntrl_CJ_02-13-13-31_like_by_hashtag_data_all_videos.csv"},
     {"run": 3, "user": "control", "videos": "liked", "path": prefix+"data/-1_Sec02Gr2Sc2Cntrl_CJ_02-13-13-31_like_by_hashtag_data_liked_videos.csv"},
     {"run": 3, "user": "active", "videos": "all", "path": prefix+"data/2_Sec02Gr2Sc2Activ_CJ_02-13-13-30_like_by_control_data_all_videos.csv"},
     {"run": 3, "user": "active", "videos": "liked", "path": prefix+"data/2_Sec02Gr2Sc2Activ_CJ_02-13-13-30_like_by_control_data_liked_videos.csv"},

     {"run": 4, "user": "control", "videos": "all", "path": prefix+"data/-1_Sec02Gr2Sc2Cntrl_CJ_02-13-13-37_like_by_hashtag_data_all_videos.csv"},
     {"run": 4, "user": "control", "videos": "liked", "path": prefix+"data/-1_Sec02Gr2Sc2Cntrl_CJ_02-13-13-37_like_by_hashtag_data_liked_videos.csv"},
     {"run": 4, "user": "active", "videos": "all", "path": prefix+"data/2_Sec02Gr2Sc2Activ_CJ_02-13-13-36_like_by_control_data_all_videos.csv"},
     {"run": 4, "user": "active", "videos": "liked", "path": prefix+"data/2_Sec02Gr2Sc2Activ_CJ_02-13-13-36_like_by_control_data_liked_videos.csv"},

     {"run": 5, "user": "control", "videos": "all", "path": prefix+"data/-1_Sec02Gr2Sc2Cntrl_CJ_02-13-13-53_like_by_hashtag_data_all_videos.csv"},
     {"run": 5, "user": "control", "videos": "liked", "path": prefix+"data/-1_Sec02Gr2Sc2Cntrl_CJ_02-13-13-53_like_by_hashtag_data_liked_videos.csv"},
     {"run": 5, "user": "active", "videos": "all", "path": prefix+"data/2_Sec02Gr2Sc2Activ_CJ_02-13-13-52_like_by_control_data_all_videos.csv"},
     {"run": 5, "user": "active", "videos": "liked", "path": prefix+"data/2_Sec02Gr2Sc2Activ_CJ_02-13-13-52_like_by_control_data_liked_videos.csv"}
    ])

## 2. Data Manipulation <a class="anchor" id="two"></a>
For each scenario:
1. Read the csvs and name each dataframe based on its run number, user type, and types of videos stored
2. Combine all runs into one single dataframe for each scenario for each user type and type of video stored (control_all, active_all, active_liked)

In [8]:
def read_csv_and_name_dfs(sc_files):
    sc = {}
    for row in range(sc_files.shape[0]):
        var_name = "r{}_{}_{}".format(sc_files.iloc[row]["run"], sc_files.iloc[row]["user"], sc_files.iloc[row]["videos"])
        sc[var_name] = pd.read_csv(sc_files.iloc[row]["path"])
        sc[var_name]["run"] = sc_files.iloc[row]["run"] #add new column for run number
    return sc

sc2 = read_csv_and_name_dfs(sc2_files)

In [9]:
def merge_runs(sc):
    to_merge_control_all = []
    to_merge_active_all = []
    to_merge_active_liked = []
    for run_num in range(1, 6): #CHANGE LATER TO 1 TO 21
        to_merge_control_all.append(sc["r{}_control_all".format(run_num)])
        to_merge_active_all.append(sc["r{}_active_all".format(run_num)])
        to_merge_active_liked.append(sc["r{}_active_liked".format(run_num)])

    control_all = pd.concat(to_merge_control_all)
    active_all = pd.concat(to_merge_active_all)
    active_liked = pd.concat(to_merge_active_liked)
    return (control_all, active_all, active_liked)

sc2_control_all = merge_runs(sc2)[0]
sc2_active_all = merge_runs(sc2)[1]
sc2_active_liked = merge_runs(sc2)[2]

NOTE: From now on, to access the information from a certain csv file, call the following: sc**2**["r**1\_control\_all**"]
where the bolded text should be replaced with
* scenario number (1,2,3,4,5,6)
* run number (1,2,3,... 20)
* if user is control or active (control, active)
* if the data has all or only liked videos (all, liked) <br>
For example: `sc2["r1_control_all"]` gives (as a dataframe) all videos that scenario 2's control user saw **for only the first run**


NOTE: To access information for all runs for a certain scenario's user and videos seen, call the following: sc**2\_control\_all**
where the bolded text should be replaced with
* scenario number (1,2,3,4,5,6)
* either one of control_all, active_all, or active_liked <br>
For example: `sc2_active_all` gives (as a dataframe) all videos that scenario 2's active user saw **during all 20 runs**

However, you can always access a subset of the data for a specific run, by the following syntax: scenario_user_video[scenario_user_video.run == specificrunnumber].

For example, `sc2_active_all[sc2_active_all.run == 2]` is equivalent to `sc2["r2_active_all"]` <br>
(which gives all the videos that scenario 2's active user saw for only the 2nd run)

## 3. Analyzing Hashtags <a class="anchor" id="two"></a>

### 3.1 Data Cleaning

In [10]:
hash_to_ignore = ["fyp", "viral", "foryou", "foryoupage", "tiktok", "fy", "fypage", "fypchallenge"]
def clean_hashtags(df):
    noNA_hash = []
    NAcount = 0
    for row in range(df.shape[0]): #for each post
        if type(df.hashtag.iloc[row])==type(""): #if not NaN types
            list_of_hash = df.hashtag.iloc[row].split(",")
            noNA_hash.append([hash.strip() for hash in list_of_hash])
        else:
            NAcount += 1
    full_list = list(set([a for b in noNA_hash for a in b]))
    return (list(filter(lambda x: x not in hash_to_ignore, full_list)), NAcount)

### 3.2 Frequency Table

In [11]:
freq = pd.Series(clean_hashtags(sc2["r1_control_all"])[0]).value_counts()
print(freq[freq>1]) #no explicit repetition

Series([], Name: count, dtype: int64)


### 3.3 Jaccard Index

In [16]:
def jaccard_similarity(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union

# an example for how to run this:
set_a = set(["cookies", "baking", "food", "recipe"])
set_b = set(["oven", "baking", "cook", "bread"])
similarity = jaccard_similarity(set_a, set_b)
print("Jaccard Similarity:", similarity)


# a) comparison 1: control vs active feeds (analyzing possible feed divergence)
#scen1_control_all = set(sc1["r1_control_all"])
#scen1_active_all = set()

# b) comparison 2: predefined hashtags to like vs actually liked posts (to compare similarities between other hashtags that were not predefined but still appeared for similar posts)
#scen1_predefined_hash = set()
#scen1_active_like = set()

Jaccard Similarity: 0.14285714285714285


In [13]:
## ANALYZING MUSIC ----------------------------------------------------------------------------------
mus = sc2["r1_control_all"].music.value_counts()
print(mus)




## ANALYZING AUTHOR ---------------------------------------------------------------------------------
aut = sc2["r1_control_all"].author.value_counts()
print(aut)

music
original sound - Gavin.kernstinee                              2
AIN'T GONNA ANSWER - NLE Choppa & Lil Wayne                    2
I Wouldnt Mind - ♱                                             1
original sound - Alex                                          1
som original - rinx                                            1
                                                              ..
original sound - fr0sty_rick                                   1
Monkeys Spinning Monkeys - Kevin MacLeod & Kevin The Monkey    1
Oi - My Soul Gone                                              1
original sound - Sam Middleton                                 1
original sound - squishiesophie                                1
Name: count, Length: 80, dtype: int64
author
mirandahshaul         2
didiios               1
dangmattsmith         1
vaynessvalery         1
wolf.spectrum         1
                     ..
leci.bby              1
madelineandstephen    1
radicalraphael        1
itz_justlola      