# Karachi Load Shedding Data

Last accessed November 29, 2020

Abstract

This program seeks to train a model that will perform better than a dummy model for predicting when the entity being blamed the most for the load-shedding issue in Karachi, Pakistan will be blamed based on tweets also blaming other entities.  We will use model selection techniques to determine if we have a model that mets this criteria or, in the event tnone of the models meet the criteria, we will simply determine the best available model.



Gathering data

First, we import the modules we will need to perform the searches. We need to gather the API keys for the Twitter module and initialize some variables for later use.

In [1]:
# Modules we will need
import config
import datetime as DT
import twitter
import pandas as pd
import numpy as np

# Pulling credentials for python-twitter API
api = twitter.Api(consumer_key = config.api_key, consumer_secret = config.api_key_secret, access_token_key = config.access_token, access_token_secret = config.access_secret)

# Create variables that use datetime module to find today and one week ago
today = DT.date.today()
week_ago = today - DT.timedelta(days = 7)

# The search criteria we will need for all raw_query GetSearch twitter pulls
static_search = 'q=load shedding karachi'

# Create a list of all blamed entities in our program
blamed_list = ['Karachi Electric', 'Imran Khan', 'Asad Umar', 'NEPRA', 'Sui Gas', 'Naeem Rehman', 'Omar Ayub Khan', 'Tehreeki Insaaf']

In this cell, we set up two functions for later use.

In [2]:
# Set up the functions that we will need.

def name_setter(search):
    '''Returns set of twitter usernames who tweeted specified search criteria using GetSearch raw_query.  Contains no duplicates.'''
    name_set = {tweet.user.screen_name for tweet in search}
    return name_set


def name_mbr_test(name_set_list, names):
    '''Returns dictionary of membership tests between two iterables.'''
    name_dict = {}
    for name in names:
        if name in name_set_list:
            name_dict[name] = True           
        else:
            name_dict[name] = False
    return name_dict 

This cell is where the actual searches take place. We take the API key authority that we set up earlier and perform a GetSearch Twitter API search that includes the static search variable we set up to search for tweets about Karachi load shedding. The extra search criteria in each search that comes immediately after the static search is based on a manual scan of how tweeters seemed to be blaming the entities.

Next, is the logic for the time frame we are trying to pull from. Here we are pulling all tweets from today and going back one week. This is the maximum available historical pull we can do with the Twitter API. Next, we set the total tweet count we are pulling to 100, the maximum allowed for the Twitter API.

The next step that I want to take in this code is to create a more automated way to find the individualized search criteria

In [3]:
# Search criteria for Karachi load shedding.

ke_search = api.GetSearch(raw_query = f'{static_search} #KE until%3A{today} since%3A{week_ago}&count=100')
KElectricPk_search = api.GetSearch(raw_query = f'{static_search} %40KElectricPk until%3A{today} since%3A{week_ago}&count=100')
imran_khan_search = api.GetSearch(raw_query = f'{static_search} %40ImranKhanPTI until%3A{today} since%3A{week_ago}&count=100')
asad_umar_search = api.GetSearch(raw_query = f'{static_search} %40Asad_Umar until%3A{today} since%3A{week_ago}&count=100')
nepra_search = api.GetSearch(raw_query = f'{static_search} #NEPRA until%3A{today} since%3A{week_ago}&count=100')
sui_search = api.GetSearch(raw_query = f'{static_search} sui until%3A{today} since%3A{week_ago}&count=100')
naeem_rehman_search = api.GetSearch(raw_query = f'{static_search} %40NaeemRehmanEngr until%3A{today} since%3A{week_ago}&count=100')
omar_ayub_khan_search = api.GetSearch(raw_query = f'{static_search} %40OmarAyubKhan until%3A{today} since%3A{week_ago}&count=100')
PTI_government_search = api.GetSearch(raw_query = f'{static_search} #PTI_government until%3A{today} since%3A{week_ago}&count=100')

In this cell, we apply the name_setter function that we created to all of the searches. The name_setter function finds all of the tweeters in the search and puts them into a set, which automatically removes duplicate names from the set. Removing duplicate names allows us to see individual "blames" against any entity that we can use to count as blames. It wouldn't make sense to gather multiple tweets by the same tweeter if what we are looking for is a case of one person blaming an entity.

In [4]:
# Take twitter user names for all of the searches
ke_set = name_setter(ke_search)
KElectricPk_set = name_setter(KElectricPk_search)
karachi_electric_set = ke_set.union(KElectricPk_set)
imran_khan_set = name_setter(imran_khan_search)
asad_umar_set = name_setter(asad_umar_search)
nepra_set = name_setter(nepra_search)
sui_set = name_setter(sui_search)
naeem_rehman_set = name_setter(naeem_rehman_search)
omar_ayub_khan_set = name_setter(omar_ayub_khan_search)
PTI_set = name_setter(PTI_government_search)

In this cell, we set up a list of the sets we created and conduct a union to merge all of the names from all of the sets together to perform more analysis. In this case, we also want to remove the duplicate names when we merge all of the sets together, so we merge the names into a set.

In [5]:
# Initiate a list of sets
name_set_list = [karachi_electric_set, imran_khan_set, asad_umar_set, nepra_set, sui_set, naeem_rehman_set, omar_ayub_khan_set, PTI_set]

# Combine all sets from name_set_list and leave into one set
blamer_set = set.union(*name_set_list)

In this cell, we create a dictionary out of the blamed list we created and the length of the name_set_list, which represents the total number of blames per entity.

In [6]:
# Find the length of the name_set_list
name_set_lengths = name_set_length_lister(name_set_list)

# Create a dictionary of the blamed entities and the number of times each entity is blamed
blamed_dict = dict(zip(blamed_list, name_set_lengths))
print(blamed_dict)

{'Karachi Electric': 15, 'Imran Khan': 3, 'Asad Umar': 0, 'NEPRA': 15, 'Sui Gas': 1, 'Naeem Rehman': 0, 'Omar Ayub Khan': 2, 'Tehreeki Insaaf': 15}


In this cell, we take the blamer_set and make it a list to apply to our membership tests. We turn all the sets into lists to avoid complications with looping through the sets when receiving duplicates. The name_mbr_test function loops throug the blame_list to determine if a name from the blame_list is in the the specified list and returns a single dictionary with names and True/False values. Finally, we put all of the membership test dictionaries into a list.

In [7]:
# Turn the set into a list
blame_list = list(blamer_set)


# Membership tests for all name sets
karachi_electric_blame_test = name_mbr_test(list(karachi_electric_set), blame_list)
imran_khan_blame_test = name_mbr_test(list(imran_khan_set), blame_list)
asad_umar_blame_test = name_mbr_test(list(asad_umar_set), blame_list)
nepra_blame_test = name_mbr_test(list(nepra_set), blame_list)
sui_blame_test = name_mbr_test(list(sui_set), blame_list)
naeem_rehman_blame_test = name_mbr_test(list(naeem_rehman_set), blame_list)
omar_ayub_khan_blame_test = name_mbr_test(list(omar_ayub_khan_set), blame_list)
PTI_blame_test = name_mbr_test(list(PTI_set), blame_list)

# Make sure we are getting a dictionary back
print(type(karachi_electric_blame_test))

# Make a list of all blame tests
blame_test_list = [karachi_electric_blame_test, imran_khan_blame_test, asad_umar_blame_test,\
                   nepra_blame_test, sui_blame_test, naeem_rehman_blame_test, omar_ayub_khan_blame_test, PTI_blame_test]


<class 'dict'>


In this cell, we create a pandas DataFrame out of the membership test dictionaries, which shows how much each individual is blaming different entities. We mask the names with generic terms to protect the identities of those who have blamed the entities. The theory is that we can use these occurences to determine the likelihood of a tweeter blaming one entity if they blame another and potentially predicting if an entity will be blamed based on the other entities being blamed.

In [9]:
# Create a dataframe out of the membership tests
karachi_ls_df = pd.DataFrame(blame_test_list, index = blamed_list)

# Set column names to Tweeter plus the column index
karachi_ls_df.columns = ['Tweeter' + str(x) for x in range(0, len(karachi_ls_df.columns))]
karachi_ls_df

Unnamed: 0,Tweeter0,Tweeter1,Tweeter2,Tweeter3,Tweeter4,Tweeter5,Tweeter6,Tweeter7,Tweeter8,Tweeter9,Tweeter10,Tweeter11,Tweeter12,Tweeter13,Tweeter14,Tweeter15
Karachi Electric,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
Imran Khan,True,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False
Asad Umar,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
NEPRA,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
Sui Gas,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False
Naeem Rehman,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Omar Ayub Khan,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False
Tehreeki Insaaf,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True


In this cell, we combine the previous searches in csv files and add them to the new search to create one dataframe.

In [39]:
# Take previous data stored in csv
karachi_load_shedding = pd.DataFrame(pd.read_csv('karachi_ls_tweeters'))
karachi_load_shedding = karachi_load_shedding.set_index('Unnamed: 0')
karachi_load_shedding.index.name = ''

second_kls_search = pd.DataFrame(pd.read_csv('karachi_ls.csv'))
second_kls_search = second_kls_search.set_index('Blamed')
second_kls_search.index.name = ''

karachi_load_shedding2 = pd.concat([karachi_load_shedding, second_kls_search], axis = 1)
karachi_load_shedding2 = karachi_load_shedding2.loc[:,~karachi_load_shedding2.columns.duplicated()]

# Combine the dataframes into one
frames = [karachi_ls_df, karachi_load_shedding2]
kls_df = pd.concat(frames, axis = 1)
kls_df.columns = ['Tweeter' + str(x) for x in range(0, len(kls_df.columns))]

# Turn True/False values into integers
kls_df = kls_df * 1

# Change index name
kls_df.index.name = 'Blamed'

kls_df
#karachi_load_shedding2
#second_kls_search

Unnamed: 0_level_0,Tweeter0,Tweeter1,Tweeter2,Tweeter3,Tweeter4,Tweeter5,Tweeter6,Tweeter7,Tweeter8,Tweeter9,...,Tweeter65,Tweeter66,Tweeter67,Tweeter68,Tweeter69,Tweeter70,Tweeter71,Tweeter72,Tweeter73,Tweeter74
Blamed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Karachi Electric,0,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
Imran Khan,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
Asad Umar,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
NEPRA,0,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
Sui Gas,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Naeem Rehman,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Omar Ayub Khan,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Tehreeki Insaaf,0,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
