## Identifying possible changed usernames within Instagram posts. 

In [1]:
import pandas as pd
import re
import difflib

The `comment_to_dict` function takes a string input containing an HTML formatted instagram comment and extracts the username, message, and timestamp information. It then returns a dictionary with these extracted values. The function uses regular expressions to search for the relevant patterns in the input string.

The `comment_to_dict` function takes a HTML formatted string input containing a username and returns the username string. 

In [2]:
def comment_to_dict(string):

    # extract username
    username = re.search('<font.*?>(.*?)</font>', string).group(1)

    # extract message
    message = re.search('</font>(.*?)\(created_at:', string).group(1).strip()

    # extract timestamp
    timestamp = re.search('created_at:(.*)\)', string).group(1).strip()

    # create a list with the extracted values
    result = {'username':username, 'message':message, 'timestamp':timestamp}
    return result

def remove_font(string):
    username = re.search('<font.*?>(.*?)</font>', string).group(1)
    return username

Reading and coverting data files to a database

In [3]:

# read in the first CSV file
df1 = pd.read_csv('/Users/christiedjidjev/Library/CloudStorage/OneDrive-Personal/BullyBlocker/Matching Instagram Sessions and Pictures 2020/labeled_0plus_to_10__full.csv', encoding='iso-8859-1')
print(type(df1))

# read in the second CSV file
df2 = pd.read_csv('/Users/christiedjidjev/Library/CloudStorage/OneDrive-Personal/BullyBlocker/Matching Instagram Sessions and Pictures 2020/labeled_10plus_to_40_full.csv', encoding='iso-8859-1')

# read in the third CSV file
df3 = pd.read_csv('/Users/christiedjidjev/Library/CloudStorage/OneDrive-Personal/BullyBlocker/Matching Instagram Sessions and Pictures 2020/labeled_40plus_full.csv', encoding='iso-8859-1')

# merge the three dataframes into a single dataframe
df = pd.concat([df1, df2, df3])

# write the merged dataframe to a new CSV file
#merged_df.to_csv('merged_insta_data.csv', index=False)


<class 'pandas.core.frame.DataFrame'>


This code extracts information from `df` to create a list of social media posts and their associated comments, which is stored in the `sessions` list. 

The `param_list` and `cmnt_param_list` variables define the parameters that will be extracted from the DataFrame for each post and comment, respectively. 

The code loops over each row in the DataFrame and for each row, it creates a new post dictionary and initializes an empty comment list. It then loops over each column between column 16 and 211 to extract comment strings. If a comment string is not empty, it is converted to a comment dictionary using the `comment_to_dict` function, and added to the comment list for the current post. 

After all comments for the current post have been processed, the code extracts the post parameters specified in `param_list` from the DataFrame, adds the comment list to the post dictionary, and appends the resulting dictionary to the `sessions` list. The code continues the process for the remaining rows in the DataFrame.


In [4]:
sessions = []

param_list =['owner_id', 'cptn_time', 'owner_cmnt'] 
cmnt_param_list =['username', 'text', 'timeposted']    

error_count = 0
for row in range(len(df)):
#for row in range(5):
    post_dict = {}
    comment_list = [] 
    for col in range(16,211): 
        cmnt_string = df.iloc[row][col] 
        if cmnt_string != 'empety':
            cmnt_string = cmnt_string.replace('created at', 'created_at')
            try:
                comment = comment_to_dict(cmnt_string)
                comment_list.append(comment)
            except: 
                error_count+=1 
                #print('cmnt_string =', cmnt_string)
        else: 
            break
    for col in param_list:
        post_dict[col] = df.iloc[row][col]
    
    post_dict['comments'] = comment_list
    sessions.append(post_dict)

print(error_count)

0


This code loops through a list of sessions (list of dictionaries). 

For each session, it extracts the owner and comment usernames from the post and comments, respectively. It then checks if all the mentioned usernames in the comments are either the post owner or present in the previous comments. 

If not, it calculates the similarity between the mentioned username and the previous usernames using the edit distance algorithm. If the similarity is greater than 0.7, it replaces the incorrect username with the closest matching username from previous comments. If the similarity is less than or equal to 0.7, it adds the comment to the `missing_posts` list. Finally, it prints the counts of correct, changed, and missing comments.

In [6]:
correct_count, changed_count, missing_count = 0, 0, 0
changed_comments = []
diff_list = []
changed_posts_count = 0
dirty_posts_count = 0
threshold = 0.8
higher_threshold =0.9
changed_users = set()
clean_posts = []

for post in sessions:
	changed_count_per_post = 0
	total_users_per_post= []
	changed_post = False
	dirty_post = False

	Session_users=[]
	post['owner_id'] = post['owner_id'].replace('\n', '')
	try:
		Session_users.append(remove_font(post['owner_id']))
	except:
		pass
	for comment in post['comments']:
		Session_users.append(comment['username'])
		usernames = re.findall(r'@(\w+)', comment['message'])
		for uname in usernames: # @name mentioned in the comment 
			total_users_per_post += uname
			if uname in Session_users:
				correct_count += 1
			else:
				best_match = None
				min_dist = 100
				for u in Session_users: # comment/post author names used in the post so far
					e_dist = difflib.ndiff(u, uname)
					e_dist = sum(1 for d in e_dist if d[0] != ' ')
					if min_dist > e_dist:
						min_dist = e_dist
						best_match = u
											
				sim = 1.0 - min_dist/max(len(best_match), len(uname))
				p = (uname, best_match)
				if sim > threshold: 
					changed_comments.append(comment)
					changed_users.add(p)
					changed_count += 1
					changed_count_per_post += 1
					changed_post = True 
					if sim <= higher_threshold:
						diff_list.append(p)
				else: 
					clean_comment = re.sub(r'<[^>]*>', '', comment['message'])
					missing_count += 1
					dirty_post = True 
			Session_users.append(uname)	
	
	if changed_post: changed_posts_count += 1
	if dirty_post: dirty_posts_count += 1
	else: clean_posts.append(post)

In [7]:
print("Total # of posts:", len(sessions))
print("Results for threshold of", threshold, "and higher threshold of", higher_threshold)
print("Counts of username by type:")
print("\tnoDirectMatch-NoSimMatch:", missing_count, "\n\tdirectMatch:", correct_count, "\n\tnoDirectMatch-NoSimMatch:", changed_count)
print('# of posts containing changed usernames:', changed_posts_count)
print('# of ("dirty") posts containing noDirectMatch-NoSimMatch usernames:', dirty_posts_count)
print('# of ("clean") posts containing not containing any noDirectMatch-NoSimMatch usernames:', len(clean_posts))

Total # of posts: 2246
Results for threshold of 0.8 and higher threshold of 0.9
Counts of username by type:
	noDirectMatch-NoSimMatch: 31053 
	directMatch: 37517 
	noDirectMatch-NoSimMatch: 1326
# of posts containing changed usernames: 999
# of ("dirty") posts containing noDirectMatch-NoSimMatch usernames: 2110
# of ("clean") posts containing not containing any noDirectMatch-NoSimMatch usernames: 136
