# Setting up the functions for data cleaning

In [None]:
import pandas as pd
import time

In [None]:
# Setup Top 100 list for comparison:

top100 = ["realDonaldTrump", "WhiteHouse", "TeamTrump", "GOPChairwoman", "DanScavino", "Jim_Jordan", "GOP", "Scavino45", "DonaldJTrumpJr",
          "IvankaTrump", "GreggJarrett", "EricTrump", "TomFitton", "paulsperry", "RepMarkMeadows", "marklevinshow", "TrumpWarRoom",
          "LindseyGrahamSC", "charliekirk11", "dbongino", "SenateGOP", "GOPLeader", "JudicialWatch", "senatemajldr", "foxandfriends",
          "seanhannity", "VP", "SteveScalise", "MZHemingway", "FLOTUS", "LouDobbs", "DailyCaller", "BuckSexton", "RepAndyBiggsAZ", "RepLeeZeldin",
          "KimStrassel", "Mike_Pence", "MariaBartiromo", "PressSec", "SaraCarterDC", "RepDougCollins", "RepMattGaetz", "NHC_Atlantic",
          "JohnWHuber", "MarshaBlackburn", "AndrewCMcCarthy", "RandPaul", "IngrahamAngle", "TVNewsHQ", "jsolomonReports",
          "CLewandowski_", "parscale", "GOPoversight", "ByronYork", "FoxNews", "GeraldoRivera", "DevinNunes", "BreitbartNews", "SenTomCotton",
          "dcexaminer", "thebradfordfile", "RNCResearch", "seanmdav", "JackPosobiec", "CDCgov", "marcorubio", "Lrihendry",
          "SenTedCruz", "thehill", "PollWatch2020", "OANN", "DavidJHarrisJr", "JennaEllisEsq", "NRA", "Varneyco", "kayleighmcenany", "RudyGiuliani",
          "hughhewitt", "JesseBWatters", "HouseGOP", "RealJamesWoods", "fema", "KatrinaPierson", "SenJohnBarrasso", "ericbolling", "HawleyMO",
          "thejtlewis", "TimMurtaugh", "RichardGrenell", "tedcruz", "bennyjohnson", "TheRightMelissa", "KellyannePolls", "SenRonJohnson",
          "SenThomTillis", "EliseStefanik", "brithume", "ChuckGrassley"]

There are only 98 followers compared to the 100 accounts above:

* One account is private
* Mike Pence's account shows up twice. Until 2016, his handle was `mike_pence`, since 2016 it's `Mike_Pence`. Twitter does not differentiate between uppercase and lowercase letters, hence it is the same account.

The following function prepares the data in a way that it can be matched against followers later:

In [None]:
def prepare_friends_data(filename):
    """
    :param filename: Twitter-handle used in the filenames
    :returns: DataFrame with clean columns
    """
    data = pd.read_csv(f"friends/{filename}_friends.csv")

    # Insert Column 1 and fill with the respective account name:
    data.insert(0, "Account", f"{filename}")

    # Rename columns:
    data.columns = ["Account", "Follows"]
    return data

The following function checks if the accounts in the data frame are also part of the top 100 list and drops all others. The output is a data frame with the following structure:

```
+-------------------------+
| Account      | Follows  |
+--------------+----------+
| Account-Name | Friend 1 |
| Account-Name | Friend 2 |
| Account-Name | Friend 3 |
| ...          | ...      |
+--------------+----------+
```

In [None]:
def clean_friends_list(data):
    """
    :param data: Clean data set as obtained from function prepare_friends_data()
    :returns: DataFrame which only contains Friends accounts that appear in the top100 list
    """
    for index, row in data.iterrows():
        if row["Follows"] not in top100:
            data.drop(index, inplace = True)
    data.reset_index(inplace = True, drop = True)
    return data

### Testing the functions:

In [None]:
test = prepare_friends_data("WhiteHouse")
print(test)

In [None]:
test2 = clean_friends_list(test)
print(test2)

# Cleaning the data in a loop

The following function automatically cleans all CSV files in the *friends* folder and creates a "friends" list for each account. The results for all accounts are written to a file called `followers_complete.csv`.

**Info:** The run to clean all data sets took 2011.23 seconds (approx. 33 Minutes).

In [None]:
def hausputz(list_to_clean):
    """
    :param list_to_clean: List of Twitter handles whose CSV files should be cleaned.
    :returns: CSV file with all data cleaned and DataFrame with clean data.
    """
    complete_data = pd.DataFrame(columns = ["Account", "Follows"])
    for account in list_to_clean:
        data_clean = clean_friends_list(prepare_friends_data(account))
        complete_data = complete_data.append(data_clean, ignore_index = True)
        print(f"Account {account} analyzed.")
    complete_data.drop([0])
    complete_data.to_csv("followers_complete.csv", index = False)
    return complete_data

In [None]:
# Determine start time:
start_time = time.time()
print("Start time: " + str(start_time))

all_friends = hausputz(top100)

# Detrmine end time and time needed for execution:
print("Finishing time: " + str(time.time()))
print("Time for execution (sec.): " + str(time.time() - start_time))

In [None]:
all_friends