# SA Instagram Profile Data Extraction

## I. Introduction
This notebook delves into the techniques and methodologies used to extract data from social media platforms like Instagram and Twitter, with a primary focus on academic research purposes. Specifically, our aim is to gain insights into the potential impact of NCAA rule changes related to the contracting of Name, Image, and Likeness (NIL) on the academic performance of student athletes (SAs). The dataset obtained through this process will serve as a valuable resource to facilitate research in this specific domain.

## II. Confidentiality
The data obtained from the extraction process is exclusively intended for academic purposes and is **not** utilized for any commercial activities. To uphold privacy and data protection, any identifiable information such as links, names, or other pertinent details related to student athletes has been anonymized or redacted.

## III. Instagram Profile Extraction

### a. Design a Function

In [7]:
# Import packages.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

In [8]:
# Set options for ChromeDriver.
chrome_options = Options()
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument('--headless') 
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

In [9]:
# Design a function that is used for scrapping data on instagram websites.
def insta_function(url: str):
    """Define a function that get followers, following, and posts from instagram websites"""
    driver = webdriver.Chrome(executable_path = '/usr/local/bin/chromedriver', chrome_options = chrome_options)
    driver.get(url)
    time.sleep(2) # 2 seconds have been tested for the shortest time to wait.
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, "html.parser")
    scripts = soup.select('div', class_ = '_aacl _aaco _aacu _aacy _aad6 _aadb _aade' )
    try: 
        full_scripts = scripts[1].get_text().strip().split()
        stories = full_scripts[0]
        followers = full_scripts[1]
        following = full_scripts[2]
    except IndexError: 
        stories = 'Error'
        followers = 'Error'
        following = 'Error'
    driver.quit()
    return [stories, followers, following]

### b. Data Pre-Processing

In [10]:
# Test if the function works for openable and non-openable website & check time of processing.
st = time.time()
print(insta_function('https://www.instagram.com/micah_burno'))
et = time.time() 
elapsed_time = et - st
print('Execution time of Openable Website:', elapsed_time, 'seconds')

st = time.time()
print(insta_function('https://www.instagram.com/malachidesousa_12345')) # Not a openable website link.
et= time.time()
elapsed_time = et - st
print('Execution time of Non-Openable Website:', elapsed_time, 'seconds')

['micah_burnoFollowMessageOptions102', 'posts2,802', 'followers900']
Execution time of Openable Website: 5.05092716217041 seconds
['Sorry,', 'this', 'page']
Execution time of Non-Openable Website: 4.707247018814087 seconds


In [12]:
# Load Each file.
data1 = pd.read_csv('athlete_socials.csv')
print(data1.info())
data2 = pd.read_csv('athlete_socials_2.csv')
print(data2.info())
data3 = pd.read_csv('athlete_socials_3.csv')
print(data3.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6600 entries, 0 to 6599
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   athlete_id   6600 non-null   object 
 1   name         6600 non-null   object 
 2   team         6600 non-null   object 
 3   verified     31 non-null     object 
 4   private      1 non-null      float64
 5   profile_url  6600 non-null   object 
dtypes: float64(1), object(5)
memory usage: 309.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2844 entries, 0 to 2843
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   athlete_id   2844 non-null   object
 1   name         2844 non-null   object
 2   team         2844 non-null   object
 3   profile_url  2844 non-null   object
dtypes: object(4)
memory usage: 89.0+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3561 entries, 0 to 3560
Data columns (total 4 colum

In [13]:
# Combine for a complete dataframe.
df = data1.append([data2, data3])
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13005 entries, 0 to 3560
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   athlete_id   13005 non-null  object 
 1   name         13005 non-null  object 
 2   team         13005 non-null  object 
 3   verified     31 non-null     object 
 4   private      1 non-null      float64
 5   profile_url  13005 non-null  object 
dtypes: float64(1), object(5)
memory usage: 711.2+ KB
None


In [27]:
# Differentiating if the url is a twitter url or ins (for applying the function);
# Sometimes for loop makes mistake due to some links have both "twitter" and "instagram" in it; 
# To make things easier, we test in "twitter" is in the the link, if not, it returns a True value
twitter = 'twitter'
instagram = 'instagram'

instagram_profile = []

for url in second_part.profile_url: 
    if twitter not in url: 
        instagram_profile.append(True)
    else: 
        instagram_profile.append(False)
print('Length of the list with both true and false:', len(instagram_profile))

# Make the list as a new column into the dataframe.
second_part['instagram_profile'] = instagram_profile
second_part.head()

Length of the list with both true and false: 12005


Unnamed: 0,athlete_id,name,team,verified,private,profile_url,instagram_profile
1000,4592432,Brandon Haddock,Utah,,,https://www.instagram.com/brandonhaddock1/,True
1001,4279626,Noah Morgan,E Michigan,,,https://www.instagram.com/datboinoah/?hl=en,True
1002,4705019,Avery Brittingham,SF Austin,,,https://twitter.com/averybrittingh2,False
1003,4705019,Avery Brittingham,SF Austin,,,https://www.instagram.com/averybrittingham11/?...,True
1004,4433613,Jackson Grant,Washington,,,https://twitter.com/jpgrant12,False


In [26]:
# Begin scrapping from row 1,000 since we have used the first 1,000 for sample test.
second_part = df.iloc[1000:].copy()
print(second_part.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12005 entries, 1000 to 3560
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   athlete_id   12005 non-null  object 
 1   name         12005 non-null  object 
 2   team         12005 non-null  object 
 3   verified     0 non-null      object 
 4   private      0 non-null      float64
 5   profile_url  12005 non-null  object 
dtypes: float64(1), object(5)
memory usage: 656.5+ KB
None


In [28]:
# Selecting only instagram url to scrap with function (5,800 rows)
only_instagram = second_part[second_part.instagram_profile == True].copy(deep = True)

print(only_instagram.head())
print('There are',len(only_instagram), 'instagram URLs in the dataset')

     athlete_id               name          team verified  private  \
1000    4592432    Brandon Haddock          Utah      NaN      NaN   
1001    4279626        Noah Morgan    E Michigan      NaN      NaN   
1003    4705019  Avery Brittingham     SF Austin      NaN      NaN   
1005    4433613      Jackson Grant    Washington      NaN      NaN   
1010    4398334      Caden Hoffman  South Dakota      NaN      NaN   

                                            profile_url  instagram_profile  
1000         https://www.instagram.com/brandonhaddock1/               True  
1001        https://www.instagram.com/datboinoah/?hl=en               True  
1003  https://www.instagram.com/averybrittingham11/?...               True  
1005        https://www.instagram.com/jpgrant.12/?hl=en               True  
1010     https://www.instagram.com/caden.hoffman/?hl=en               True  
There are 6205 instagram URLs in the dataset


### c. Data Extraction
Since each URL takes up to 5 seconds to extract data, the amount of time we expect for completing 6,205 URLs is 8.6 hours. Therefore, we are take breaking the total URLs into segments, each segment comprises 2,000 URLs, which takes 2.7 hours to complete.

In [39]:
# Start with the first segment
segment_A = only_instagram.iloc[0:2000].copy(deep = True)
len(segment_A)

2000

In [31]:
# Create 3 empty lists
ins_posts = []
ins_followers = []
ins_following = []

In [32]:
# Build full function 
for url in segment_A.profile_url: 
    if instagram in url: 
        outcome = insta_function(url)
        ins_posts.append(outcome[0])
        ins_followers.append(outcome[1])
        ins_following.append(outcome[2])
    else:
        ins_posts.append('Error')
        ins_followers.append('Error')
        ins_following.append('Error')
print(ins_posts)
print(ins_followers)
print(ins_following)

['brandonhaddock1FollowMessage11', 'datboinoahFollowMessage90', 'averybrittingham11FollowMessage19', 'jpgrant.12VerifiedFollowMessage44', 'caden.hoffmanFollowMessage69', 'scottduncanwxVerifiedFollowMessage595', 'lat_mayen11VerifiedFollowMessage15', 'camt3Follow0', 'roriharmonVerifiedFollowMessage82', 'jfinneyyFollowMessage10', 'waheed5_FollowMessage8', 'jadalogan92Follow16', 'alfispilFollowMessage89', 'greg_jones3FollowMessage29', 'm_rob5FollowMessage45', 'gabzilla32Follow1', 'mritchie8Follow135', 'nurturednutrition_FollowMessage114', 'blayreshultzFollowMessage142', 'lmcook22Follow125', '_arnaldotoroFollowMessage38', 'miamihurricanescaneshoopscanesempirecanesuniverseeasyyyooosouthbeachsportss•Follow35', 'boffeli_graceFollowMessage33', 'jamie_r10Follow65', 'nydia_lampkin21FollowMessage25', 'chet_holmgrenVerifiedFollowMessage87', 'okaformichael115Follow12', 'adaywithrossFollow14', 'ly_rehnstromFollow9', 'shaqj_treceFollowMessage10', 'coltonreed98FollowMessage22', 'caroline_waiteFollowMes

In [35]:
# Make a copy of the first 2,000 we ran as lists 
posts_A = ins_posts.copy()
followers_A = ins_followers.copy()
following_A = ins_following.copy()

In [36]:
# Check the length of these lists
print(len(ins_posts))
print(len(ins_followers))
print(len(ins_following))

2000
2000
2000


In [42]:
# Continue with Segment B 
segment_B = only_instagram.iloc[2000:4000].copy(deep = True)
len(segment_B)

2000

In [43]:
# Build full function on Segment B
for url in segment_B.profile_url: 
    if instagram in url: 
        outcome = insta_function(url)
        ins_posts.append(outcome[0])
        ins_followers.append(outcome[1])
        ins_following.append(outcome[2])
    else:
        ins_posts.append('Error')
        ins_followers.append('Error')
        ins_following.append('Error')
print(ins_posts)
print(ins_followers)
print(ins_following)

['brandonhaddock1FollowMessage11', 'datboinoahFollowMessage90', 'averybrittingham11FollowMessage19', 'jpgrant.12VerifiedFollowMessage44', 'caden.hoffmanFollowMessage69', 'scottduncanwxVerifiedFollowMessage595', 'lat_mayen11VerifiedFollowMessage15', 'camt3Follow0', 'roriharmonVerifiedFollowMessage82', 'jfinneyyFollowMessage10', 'waheed5_FollowMessage8', 'jadalogan92Follow16', 'alfispilFollowMessage89', 'greg_jones3FollowMessage29', 'm_rob5FollowMessage45', 'gabzilla32Follow1', 'mritchie8Follow135', 'nurturednutrition_FollowMessage114', 'blayreshultzFollowMessage142', 'lmcook22Follow125', '_arnaldotoroFollowMessage38', 'miamihurricanescaneshoopscanesempirecanesuniverseeasyyyooosouthbeachsportss•Follow35', 'boffeli_graceFollowMessage33', 'jamie_r10Follow65', 'nydia_lampkin21FollowMessage25', 'chet_holmgrenVerifiedFollowMessage87', 'okaformichael115Follow12', 'adaywithrossFollow14', 'ly_rehnstromFollow9', 'shaqj_treceFollowMessage10', 'coltonreed98FollowMessage22', 'caroline_waiteFollowMes

In [45]:
# Check the length of these lists
print(len(ins_posts))
print(len(ins_followers))
print(len(ins_following))

4000
4000
4000


In [46]:
# Make a copy of the first 4,000 we ran as lists 
posts_B = ins_posts.copy()
followers_B = ins_followers.copy()
following_B = ins_following.copy()

In [47]:
# Continue with Segment C
segment_C = only_instagram.iloc[4000:].copy(deep = True)
len(segment_C)

2205

In [None]:
# Build full function on Segment C
for url in segment_C.profile_url: 
    if instagram in url: 
        outcome = insta_function(url)
        ins_posts.append(outcome[0])
        ins_followers.append(outcome[1])
        ins_following.append(outcome[2])
    else:
        ins_posts.append('Error')
        ins_followers.append('Error')
        ins_following.append('Error')
print(ins_posts)
print(ins_followers)
print(ins_following)

In [49]:
# Check the length of these lists
print(len(ins_posts))
print(len(ins_followers))
print(len(ins_following))

5669
5669
5669


In [50]:
print(ins_posts)

['brandonhaddock1FollowMessage11', 'datboinoahFollowMessage90', 'averybrittingham11FollowMessage19', 'jpgrant.12VerifiedFollowMessage44', 'caden.hoffmanFollowMessage69', 'scottduncanwxVerifiedFollowMessage595', 'lat_mayen11VerifiedFollowMessage15', 'camt3Follow0', 'roriharmonVerifiedFollowMessage82', 'jfinneyyFollowMessage10', 'waheed5_FollowMessage8', 'jadalogan92Follow16', 'alfispilFollowMessage89', 'greg_jones3FollowMessage29', 'm_rob5FollowMessage45', 'gabzilla32Follow1', 'mritchie8Follow135', 'nurturednutrition_FollowMessage114', 'blayreshultzFollowMessage142', 'lmcook22Follow125', '_arnaldotoroFollowMessage38', 'miamihurricanescaneshoopscanesempirecanesuniverseeasyyyooosouthbeachsportss•Follow35', 'boffeli_graceFollowMessage33', 'jamie_r10Follow65', 'nydia_lampkin21FollowMessage25', 'chet_holmgrenVerifiedFollowMessage87', 'okaformichael115Follow12', 'adaywithrossFollow14', 'ly_rehnstromFollow9', 'shaqj_treceFollowMessage10', 'coltonreed98FollowMessage22', 'caroline_waiteFollowMes

In [51]:
# Continue with Segment D
segment_D = only_instagram.iloc[5669:].copy(deep = True)
len(segment_D)

536

In [52]:
# Build full function on Segment D
for url in segment_D.profile_url: 
    if instagram in url: 
        outcome = insta_function(url)
        ins_posts.append(outcome[0])
        ins_followers.append(outcome[1])
        ins_following.append(outcome[2])
    else:
        ins_posts.append('Error')
        ins_followers.append('Error')
        ins_following.append('Error')
print(ins_posts)
print(ins_followers)
print(ins_following)

['brandonhaddock1FollowMessage11', 'datboinoahFollowMessage90', 'averybrittingham11FollowMessage19', 'jpgrant.12VerifiedFollowMessage44', 'caden.hoffmanFollowMessage69', 'scottduncanwxVerifiedFollowMessage595', 'lat_mayen11VerifiedFollowMessage15', 'camt3Follow0', 'roriharmonVerifiedFollowMessage82', 'jfinneyyFollowMessage10', 'waheed5_FollowMessage8', 'jadalogan92Follow16', 'alfispilFollowMessage89', 'greg_jones3FollowMessage29', 'm_rob5FollowMessage45', 'gabzilla32Follow1', 'mritchie8Follow135', 'nurturednutrition_FollowMessage114', 'blayreshultzFollowMessage142', 'lmcook22Follow125', '_arnaldotoroFollowMessage38', 'miamihurricanescaneshoopscanesempirecanesuniverseeasyyyooosouthbeachsportss•Follow35', 'boffeli_graceFollowMessage33', 'jamie_r10Follow65', 'nydia_lampkin21FollowMessage25', 'chet_holmgrenVerifiedFollowMessage87', 'okaformichael115Follow12', 'adaywithrossFollow14', 'ly_rehnstromFollow9', 'shaqj_treceFollowMessage10', 'coltonreed98FollowMessage22', 'caroline_waiteFollowMes

In [53]:
print(len(ins_posts))
print(len(ins_followers))
print(len(ins_following))
# Second part is now complete

6205
6205
6205


In [129]:
# Adding the lists into the dataframe as new columns
only_instagram['ins_posts'] = ins_posts
only_instagram['ins_followers'] = ins_followers
only_instagram['ins_following'] = ins_following
len(only_instagram)

6205

In [133]:
# Read the first 1,000 rows that we obtained previously 

first_thousand = pd.read_csv('first_1000.csv')
first_thousand.head()

# Select only twitter profile urls, excluding instagram profile urls
if_twitter = [] 
twitter = 'twitter'

for url in first_thousand.profile_url: 
    if twitter in url: 
        if_twitter.append(True)
    else: 
        if_twitter.append(False)
print(len(if_twitter))

first_thousand['if_twitter'] = if_twitter

first_thousand_instagram = first_thousand[first_thousand['if_twitter'] == False].copy()

first_thousand_instagram.head()

1000


Unnamed: 0.1,Unnamed: 0,athlete_id,name,team,verified,private,profile_url,twt_follower,twt_following,twt_tweets,twt_teneur,ins_posts,ins_followers,ins_following,if_twitter
2,2,2498368,Tim Harrison,Rice,n,,https://www.instagram.com/tim_harrison_music/?...,,,,,tim_harrison_musicFollow44,posts235,followers461,False
4,4,4397189,Jalen Smith,Maryland,y,,https://www.instagram.com/thejalen_smith/?hl=en,,,,,thejalen_smithVerifiedFollowMessage37,posts51.2K,followers483,False
5,5,4603382,Eric Bower,UCSB,y,,https://www.instagram.com/et.bower/?hl=en,,,,,et.bowerFollow10,"posts1,124","followers1,118",False
7,7,4595703,Alana Perkins,Bryant,y,1.0,https://www.instagram.com/alana.perkins/?hl=en,,,,,alana.perkinsFollow35,"posts1,902","followers1,402",False
10,10,4279222,Tomas Murphy,Northeastern,y,,https://www.instagram.com/tomasmurphy33/?hl=en,,,,,tomasmurphy33Follow60,"posts2,366","followers1,072",False


In [134]:
# Check the total length of the first 1,000 URLs we extracted info from.
print('There are', len(first_thousand_instagram), 'instagram profiles in the first 1,000 rows of the original dataset')

There are 516 instagram profiles in the first 1,000 rows of the original dataset


In [135]:
# Append the rest of instagram profiles with the ones in the first thousand
all_instagram = first_thousand_instagram.append(only_instagram)
all_instagram.head()
len(all_instagram)

6721

In [65]:
# Check for null values
all_instagram.isnull().sum()

Unnamed: 0           6205
athlete_id              0
name                    0
team                    0
verified             6705
private              6720
profile_url             0
twt_follower         6718
twt_following        6718
twt_tweets           6718
twt_teneur           6718
ins_posts               0
ins_followers           0
ins_following           0
if_twitter           6205
instagram_profile     516
dtype: int64

In [70]:
# Check if the account is verified
ins_verified_list = []
verified = 'Verified'


for x in all_instagram.ins_posts: 
    if verified in str(x): 
        ins_verified_list.append(1)
    else:
        ins_verified_list.append(0)
print('Instagram Verified List:', ins_verified_list)
print(len(ins_verified_list))
all_instagram['ins_verified'] = ins_verified_list

Instagram Verified List: [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,

In [72]:
# Save a copy.
all_instagram.to_csv('all_instagram.csv')

In [119]:
# Check for both senarios (verified or not-verified) in order to get post numbers.
my_string = 'malachidesousa_FollowMessage27'
print(my_string.split('ollow')[1].replace('Message', '').replace(',', ''))

#*Verified 
my_string2 = 'adamflaglerVerifiedFollowMessage40'
print(my_string2.split('ollow')[1].replace('Message', '').replace(',', '')) 

# Both measure works.

27
40


In [126]:
# Read the file copy that we just saved.
df = pd.read_csv('all_instagram.csv')
df.isnull().sum()

Unnamed: 0              0
Unnamed: 0.1         6205
athlete_id              0
name                    0
team                    0
verified             6705
private              6720
profile_url             0
twt_follower         6718
twt_following        6718
twt_tweets           6718
twt_teneur           6718
ins_posts               0
ins_followers           0
ins_following           0
if_twitter           6205
instagram_profile     516
ins_verified            0
dtype: int64

In [127]:
# Check the head of the dataframe 
df.head()
len(df)

6721

In [108]:
# Modify each list to get only integers for dataframe
ins_post_list = []
ins_follower_list = []
ins_following_list = []


for a in df.ins_posts: 
    post_number = a.split('ollow')[-1].replace('Message', '').replace(',', '').replace('Sorry', 'NaN')
    ins_post_list.append(post_number)
print('Dataframe Instagram Post:', ins_post_list)

for b in df.ins_followers: 
    follower_number = b.replace('posts', '').replace(',', '').replace('this', 'NaN')
    ins_follower_list.append(follower_number)
print('Dataframe Instagram Follower:', ins_follower_list)

for c in df.ins_following: 
    following_number = c.replace('followers', '').replace(',', '').replace('page', 'NaN')
    ins_following_list.append(following_number)
print('Dataframe Instagram Following:', ins_following_list)

Dataframe Instagram Post: ['44', '37', '10', '35', '60', '185', '739', '278', '27', '939', '41', '129', '30', '216', '9', '12', '24', '17', '346', '83', '154', '69', '66', '86', '66', '167', '40', '274', '3', '116', '32', '95', '8', '22', '11', '275', '11', '11', '82', '949', '1', '4', '23', '17', '24', '19', '34', '61', '70', '141', '40', '52', '74', '17', '28', '72', '94', '46', '31', '186', '15', '640', '28', '14', '13', '74', '6', '5', '5635', '138', '98', '111', '53', '58', '19', '42', '39', '82', '140', '18', '20', '12', '16', '190', '18', '17', '117', '4', '6', '10', '10', '743', '24', '581', '13', '157', '29', '30', '462', '337', '100', '385', '33', '287', '24', '17', '12', '3118', '16', '7', '6', '114', '494', '61', '5', '90', '78', '52', '31', '3', '179', '13', '353', 'Phone', '145', '43', '6', '105', '26', '18', '8', '340', '93', '11', '205', '106', '452', '39', '6', '151', '89', '170', 'NaN', '24', '231', '65', '18', '37', '725', '29', '54', '19', '6', '32', '34', '28', '0'

In [111]:
# Make the final lists.
# Considering 4 different senarios, some are integers with decimal points & K or M.
# Others are just integers with K or M.

final_post = [] 
final_follower = []
final_following = []

for a in ins_post_list:
    if '.' in str(a) and 'K' in str(a): 
        post = a.replace('.', '').replace('K', '00')
    elif '.' in str(a) and 'M' in str(a): 
        post = a.replace('.', '').replace('M', '00000')
    elif '.' not in str(a) and 'K' in str(a): 
        post = a.replace('K', '000')
    elif '.' not in str(a) and 'M' in str(a): 
        post = a.replace('K', '000000')
    else: 
        post = a
    final_post.append(post)
    
for b in ins_follower_list:
    if '.' in str(b) and 'K' in str(b): 
        follower = b.replace('.', '').replace('K', '00')
    elif '.' in str(b) and 'M' in str(b): 
        follower = b.replace('.', '').replace('M', '00000')
    elif '.' not in str(b) and 'K' in str(b): 
        follower = b.replace('K', '000')
    elif '.' not in str(b) and 'M' in str(b): 
        follower = b.replace('K', '000000')
    else: 
        follower = b
    final_follower.append(follower)

for c in ins_following_list:
    if '.' in str(c) and 'K' in str(c): 
        following = c.replace('.', '').replace('K', '00')
    elif '.' in str(c) and 'M' in str(c): 
        following = c.replace('.', '').replace('M', '00000')
    elif '.' not in str(c) and 'K' in str(c): 
        following = c.replace('K', '000')
    elif '.' not in str(c) and 'M' in str(c): 
        following = c.replace('K', '000000')
    else: 
        following = c
    final_following.append(following)
    
print(len(final_post))
print(len(final_follower))
print(len(final_following))

6721
6721
6721


In [112]:
# Add the modified lists back to dataframe 
df['ins_posts'] = final_post
df['ins_followers'] = final_follower
df['ins_following'] = final_following

In [113]:
# Check head of the dataframe
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,athlete_id,name,team,verified,private,profile_url,twt_follower,twt_following,twt_tweets,twt_teneur,ins_posts,ins_followers,ins_following,if_twitter,instagram_profile,ins_verified
0,2,2.0,2498368,Tim Harrison,Rice,n,,https://www.instagram.com/tim_harrison_music/?...,,,,,44,235,461,False,,0
1,4,4.0,4397189,Jalen Smith,Maryland,y,,https://www.instagram.com/thejalen_smith/?hl=en,,,,,37,51200,483,False,,1
2,5,5.0,4603382,Eric Bower,UCSB,y,,https://www.instagram.com/et.bower/?hl=en,,,,,10,1124,1118,False,,0
3,7,7.0,4595703,Alana Perkins,Bryant,y,1.0,https://www.instagram.com/alana.perkins/?hl=en,,,,,35,1902,1402,False,,0
4,10,10.0,4279222,Tomas Murphy,Northeastern,y,,https://www.instagram.com/tomasmurphy33/?hl=en,,,,,60,2366,1072,False,,0


In [115]:
# Make final dataframe for maunal modification 
instagram_finished = df[['athlete_id', 'name', 'team', 'verified', 'profile_url', 'twt_follower', 
                         'twt_following', 'twt_tweets', 'twt_teneur', 'ins_posts', 'ins_followers',
                         'ins_following', 'ins_verified']]
instagram_finished.head()

Unnamed: 0,athlete_id,name,team,verified,profile_url,twt_follower,twt_following,twt_tweets,twt_teneur,ins_posts,ins_followers,ins_following,ins_verified
0,2498368,Tim Harrison,Rice,n,https://www.instagram.com/tim_harrison_music/?...,,,,,44,235,461,0
1,4397189,Jalen Smith,Maryland,y,https://www.instagram.com/thejalen_smith/?hl=en,,,,,37,51200,483,1
2,4603382,Eric Bower,UCSB,y,https://www.instagram.com/et.bower/?hl=en,,,,,10,1124,1118,0
3,4595703,Alana Perkins,Bryant,y,https://www.instagram.com/alana.perkins/?hl=en,,,,,35,1902,1402,0
4,4279222,Tomas Murphy,Northeastern,y,https://www.instagram.com/tomasmurphy33/?hl=en,,,,,60,2366,1072,0


In [116]:
# Save a copy of the result 
instagram_finished.to_csv('instagram_finished.csv')

In [123]:
# Check the length of the final dataframe.
len(instagram_finished)

6721

## IV. Instagram Profile Data Extraction is Completed.