### Twitter texts  emoji collecting

Finding emoji

Initial pass.  Check data feasibility.

For more unicode code points look [here.](https://unicode.org/emoji/charts/full-emoji-list.html)

In [27]:
from re import findall
import unicodedata
from collections import defaultdict
import os

#üòÇ code: 128514 face with tears of joy
#üëç code: 128077 thumbs up
#üî• code: 128293 fire
codes = [128514,128077,128293]
######  OR ##################
#inc=60
#inc=1036
#codes = range(0x1F600, 0x1F600+inc)


chat_corpus_dir = 'chat_corp/chat_corpus-master/'
emoji_cts_initial = defaultdict(list)

twitter_data_location = os.path.join(chat_corpus_dir, 'twitter_en.txt')

with open (twitter_data_location) as fh:
    # A list of lines
    text_str = fh.read()
 
# There are long uninterrupted seqs of printable emoji starting at 0x1F600
# and 0001f900
for code in codes:
    found = findall(chr(code),text_str)
    try:
        print(f"{code:08x} {chr(code)} {unicodedata.name(chr(code))} {len(found)} found!")
    except ValueError:
        print(f"****Couldn't print {code}!*****")
    emoji_cts_initial[code] = len(found)

0001f602 üòÇ FACE WITH TEARS OF JOY 24225 found!
0001f44d üëç THUMBS UP SIGN 2459 found!
0001f525 üî• FIRE 2331 found!


#### The main task

1. For each of the three chosen emoji types:

   a. Find the tweets where the emoji type occurs at the end of tweet.  Save to a file
   b. Find the tweets wherethe emoji type occurs at the start of tweet.  Save to a file
   c. Find the tweets where the emoji type occurs anywhere in the tweet.  Save to a file

2.  For all the hits in 1, preserve the line number and get the other half of the conversational pair
    that line belongs to.  For tweets with index i, i is odd, get tweet i - 1 (00-based indexing).
    For tweets with index i, i is even, get tweet i + 1.

The main product of the following code if three dictionaries containing the desired tweet data: `emoji_results`, `emoji_results_start`, and   `emoji_results_end`.

The keys for all the dictionaries are the emoji codes.  See bellow how the values are organized.

In [4]:
import re
from re import findall
import unicodedata
from collections import defaultdict, Counter


with open (twitter_data_location) as fh:
    # A list of lines
    text = fh.readlines()
    
emoji_results = defaultdict(list)
emoji_results_start = defaultdict(list)
emoji_results_end = defaultdict(list)
emoji_counts = Counter()
double_hits =  defaultdict(set)
# The three choisen emoji
codes = [128514,128077,128293]

for (i,line) in enumerate(text):
    line=line.strip()
    for code in codes:
        if i in double_hits[code]:
            continue
        emoji = chr(code)
        if re.search(emoji,line):
            if i % 2 == 1:
                other_line = text[i-1].strip()
                pair = (i-1, other_line, line)
            else:
                other_line = text[i+1].strip()
                pair = (i, line, other_line)
                if re.search(emoji,other_line):
                    double_hits[code].add(i+1)
                    emoji_counts[code] += other_line.count(emoji)
            emoji_results[code].append(pair)
            if re.match(chr(code),line):
                emoji_results_start[code].append(pair)
            if line.endswith(chr(code)):
                emoji_results_end[code].append(pair)
            emoji_counts[code] += line.count(emoji)
            

Checking that (even,odd) constitutes a conversational pair, because of 0-based indexing.

In [88]:
start=38
for (i,l) in enumerate(text[start:50],start):
    print(i, l)
    if i%2 == 1:
        print("="*30, end="\n\n")

38 you think so ? a lot girls been fucking with it just not the right ones lmfao, but i'm thinking its time for the new cut üòÇ

39 idk how females fuck with this üòÇ


40 dad!! ‚Äôs dad is our new favorite person.

41 u gotta love his dadüòç


42 i bet these guys that are playing d for philly feel so fresh!

43 after those takeaways! bet they're competing for who gets the next one!!


44 4-0. i don't remember you ever winning. &amp; i took this one. you didn't "let" nothing. üòÇ

45 üò≠üò≠üò≠ i thought you was gonna let me slide


46 oh no she's back üòí

47 you talking about the brand ambassador of watsapp?


48 after a long hiatus, i've joined a gym. thus ends my rather wonderful minimum viable body phase. üôèüèΩ

49 bay club or equinox?




Checking the first 3 hit pairs for FACE WITH TEARS OF JOY in the dictionary `emoji_results`.  One member of the conversational pair should have FACE WITH TEARS OF JOY somewhere in it.

In [5]:
code = 128514
face = emoji_results[code]

print(emoji_counts[code]) # Correct number of hits based on initial pass counts
print(len(face)) # These two numbers are different because a single conversational pair
                 # may contain mutltiple hits on the same emoji.  See 44 below
for tw in face[:3]:
    print(tw)

24225
12166
(38, "you think so ? a lot girls been fucking with it just not the right ones lmfao, but i'm thinking its time for the new cut üòÇ", 'idk how females fuck with this üòÇ')
(44, '4-0. i don\'t remember you ever winning. &amp; i took this one. you didn\'t "let" nothing. üòÇ', 'üò≠üò≠üò≠ i thought you was gonna let me slide')
(52, "you think so ? a lot girls been fucking with it just not the right ones lmfao, but i'm thinking its time for the new cut üòÇ", 'gotta get that brad pitt from fury')


Checking dictionary `emoji_results_start`: One member of the conversational pair should have FACE WITH TEARS OF JOY at the start.

In [6]:
code = 128514
face_start = emoji_results_start[code]

print(len(face_start)) # These two numbers are different because a single conversational pair
                 # may contain mutltiple hits on the same emoji.  See 44 below
for tw in face_start[:3]:
    print(tw)

1068
(1916, 'give him some water üò≠', "üòÇüòÇüòÇüòÇ we should've gave him some. he would've been out of the race")
(2174, 'üòÇüòÇüòÇüòÇ i9milha boogie', 'iphone 7 w nba 2k17 oula full o*****m üòÇ')
(2348, 'dad: u want me to pay for this?! me: ya...sorry dad: well thats okay guys i got a wallet that says shit ton of money', 'üòÇi love it!!!')


Checking dictionary `emoji_results_end`: One member of the conversational pair should have FACE WITH TEARS OF JOY at the end.

In [66]:
code = 128514
face_end = emoji_results_end[code]

print(len(face_end)) # These two numbers are different because a single conversational pair
                 # may contain mutltiple hits on the same emoji.  See 44 below
for tw in face_end[-3:]:
    print(tw)

8048
(754300, "i know someone who'd cry..", 'lol not gonna lie i probably would cry üòÇ')
(754490, "omg beau brought his friend who is from san diego, but is actually from london but he's mexican and italian i'm so confused üòÇüòÇ", "and it doesn't help that he's drunk üíÄüíÄ")
(754496, 'thou have broughten the fermented oat elixirüòÇüòÇ', 'i poureth some for the comrades')


#### Saving to files

The basic idea.  Create DataFrames. Use `df.to_csv()`.

We create 9 DataFrames and save to 9 files.  There are 3 emojis 

```
codes = [128514,128077,128293]
```

and three emoji locations (anywhere, start, end), yielding 9 files. For example, the 
`anywhere_face_with_tears_of_joy.csv` contains Tweets with the emoji TEARS OF JOY
anywhere in the Tweet.

From the `emoji_results` dictionary to one of the 9 DataFrames:

In [7]:
import pandas as pd
(index, initial, response) = zip(*emoji_results[code])
df = pd.DataFrame(dict(Utterance=initial,Response=response),index=index, columns=["Utterance","Response"])

In [8]:
df.head()

Unnamed: 0,Utterance,Response
38,you think so ? a lot girls been fucking with i...,idk how females fuck with this üòÇ
44,4-0. i don't remember you ever winning. &amp; ...,üò≠üò≠üò≠ i thought you was gonna let me slide
52,you think so ? a lot girls been fucking with i...,gotta get that brad pitt from fury
184,guess youre gon have to bite me hahaha,&lt;-- she volunteered so... yeah. haha. go fo...
216,no you deadass said i'm getting a crouton for ...,this is funny af üòÇ


Simultaneously create all 9 DataFrames and save them to 9 files.

In [9]:
fn_stems = ["anywhere", "start", "end"]
emoji_dicts = dict(anywhere=emoji_results, start=emoji_results_start, end=emoji_results_end)
file_dict = dict()

def get_filename (code,emoji_loc):
    emoji_name_str = '_'.join(unicodedata.name(chr(code)).lower().split())
    return '_'.join([emoji_loc,emoji_name_str]) + '.csv'

for emoji_loc in fn_stems:
    emoji_dict = emoji_dicts[emoji_loc]
    for code in emoji_dict:
        (index, initial, response) = zip(*emoji_dict[code])
        df = pd.DataFrame(dict(Utterance=initial,Response=response),
                          index=index, 
                          columns=["Utterance","Response"])
        fn = get_filename (code,emoji_loc)
        file_dict[fn] = df
        df.to_csv(fn,header=True,index=True)

Looking at the data.

In [10]:
fns = list(file_dict.keys())
fn = fns[0]
print(fn)
pd.set_option('display.column_space', 1000)
df = file_dict[fn]
df

anywhere_face_with_tears_of_joy.csv


  pd.set_option('display.column_space', 1000)


Unnamed: 0,Utterance,Response
38,you think so ? a lot girls been fucking with i...,idk how females fuck with this üòÇ
44,4-0. i don't remember you ever winning. &amp; ...,üò≠üò≠üò≠ i thought you was gonna let me slide
52,you think so ? a lot girls been fucking with i...,gotta get that brad pitt from fury
184,guess youre gon have to bite me hahaha,&lt;-- she volunteered so... yeah. haha. go fo...
216,no you deadass said i'm getting a crouton for ...,this is funny af üòÇ
...,...,...
754170,üòÇüòÇ ima hit the r2 button and hit her with that...,lmaoooo nooo not the coon cock lmaoooo
754246,he felt so bad üòÇüíÄ,lonzo ball too good!
754300,i know someone who'd cry..,lol not gonna lie i probably would cry üòÇ
754490,omg beau brought his friend who is from san di...,and it doesn't help that he's drunk üíÄüíÄ


To look at complete tweets (change column space so the tweets don't get truncated(:

In [113]:
# First 20 only utterance column.
df_str = df.iloc[:20][["Utterance"]].to_string(col_space=1000)
print(df_str)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

Look at at one row one column.

In [110]:
df.loc[754170]["Utterance"]

'üòÇüòÇ ima hit the r2 button and hit her with that coom cock üòÇ'

Same tweet response column.

In [114]:
df.loc[754170]["Response"]

'lmaoooo nooo not the coon cock lmaoooo'

#### Retrieving from file

Every row in a tweet file has conversational pairs (2nd tweet is response to first).

In [28]:
import pandas as pd 


# For the file with face with tears of joy emojis
# all at the start of one the tweets in the conversational pair.
code,emoji_loc= 128514,"end"
chat_corpus_data_dir = 'chat_corp/chat_corpus_data'

fn =  os.path.join(chat_corpus_data_dir, get_filename (code,emoji_loc))
fn

'chat_corp/chat_corpus_data/end_face_with_tears_of_joy.csv'

In [29]:
df = pd.read_csv(fn,index_col=0,header=0)
len(df)

8048

In [18]:
df

Unnamed: 0,Utterance,Response
38,you think so ? a lot girls been fucking with i...,idk how females fuck with this üòÇ
44,4-0. i don't remember you ever winning. &amp; ...,üò≠üò≠üò≠ i thought you was gonna let me slide
52,you think so ? a lot girls been fucking with i...,gotta get that brad pitt from fury
184,guess youre gon have to bite me hahaha,&lt;-- she volunteered so... yeah. haha. go fo...
216,no you deadass said i'm getting a crouton for ...,this is funny af üòÇ
...,...,...
754170,üòÇüòÇ ima hit the r2 button and hit her with that...,lmaoooo nooo not the coon cock lmaoooo
754246,he felt so bad üòÇüíÄ,lonzo ball too good!
754300,i know someone who'd cry..,lol not gonna lie i probably would cry üòÇ
754490,omg beau brought his friend who is from san di...,and it doesn't help that he's drunk üíÄüíÄ


In [31]:
code,loc = 128293,"anywhere"
fn =  get_filename (code,loc)
fn

'anywhere_fire.csv'

In [33]:
fn = os.path.join(chat_corpus_data_dir, get_filename (code,emoji_loc))
df = pd.read_csv(fn,index_col=0,header=0)
print(len(df))
df

617


Unnamed: 0,Utterance,Response
486,no panty up at tonight üáµüá∑üëäüèæ,that song is üî•üî•üî•üî•
3230,these joints are so üî•,"so flee, remind me of these laser af 1 but way..."
3452,happy birthday gabby ‚ù£üî•üî•üî•,thank you !!!!üíôüíô
4268,yo it's true,your article voiced all the feelings i had whe...
4814,burn it üî•üî•üî•üî•üî•,nah bruh he's my exterminator. üï∑üï∑üï∑üï∑
...,...,...
751664,now you're fired jesse .... üî• but the bad kind...,you're just jealous that jesse and i have an u...
752806,i knowüòä i'll still be waiting for the invitati...,don't worry bruh i gotchuüî•
753554,mi pueblo had strawberries for 1$ each pack üî•üî•üî•,thanks for getting me some
753860,a friend that's down 99% of the time. üî•üî•üî•,that line defines me


In [34]:
len(df)

617