# Building Up List of Suitable Tweet Search Terms

In [1]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

In [2]:
from glob import glob

import pandas as pd

## About

This notebook performs the following
1. loads the sampled data that was
   - created in `/3-combine-data/notebooks/3-combine-data.ipynb`
   - manually labeled with the subject of the tweet
2. filters the data to capture the (manually labeled) wanted subjects
3. builds up a list of search terms that captures as many tweets in the filtered data (i.e. as many tweets belonging to a wanted subject)

## User Inputs

In [3]:
processed_data_dir = "../data/processed"

cols_to_use = ["text", "subject"]

wanted_list = [
    "webb",
    "jwst",
    "webb space",
    "webb telescope",
    "mirror deploy",
    "space telescope",
    "telescope launch",
    "new telescope",
    "live coverage",
    "live stream",
    "livestream",
    "congratulations",
    "congrats",
    "unfolding",
    "tightening",
    "tensioning",
    "sunshield",
    "sunshade",
    "l2",
    "heat shield",
    "primary mirror",
    "secondary mirror",
]
wanted_subject_strs = ["Jwst-mission", "Jwst-facts"]

In [4]:
wanted_str = "|".join(wanted_list)

## Get Data

Load annotated data

In [5]:
%%time
df = pd.concat(
    [
        pd.read_excel(f, usecols=cols_to_use).dropna(subset=["subject"])
        for f in glob(f"{processed_data_dir}/*.xlsx")
    ],
    axis=0,
    ignore_index=True,
)
with pd.option_context('display.max_colwidth', None):
    display(df.head())
df.info()

Unnamed: 0,text,subject
0,Q: How do you send a payload to the James Webb Space Telescope?A: ${jwst:ldap://,Jwst-mission
1,Why NASA's James Webb telescope Twitter account blocked other NASA Twitter accounts [,Jwst-facts
2,"With Webb’s Mid-Booms Extended, Sunshield Takes Shape – James Webb Space Telescope Sunshield STARBOARD Mid-BoomThe Right/Starboard (-J2) Sunshield Boom Deployment",Jwst-mission
3,The first image from the new James Webb telescope is in!,Jwst-mission
4,Still can't believe we did the space shuttle. Nation of absolute madmen.,Other-missions


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1348 entries, 0 to 1347
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     1348 non-null   object
 1   subject  1348 non-null   object
dtypes: object(2)
memory usage: 21.2+ KB
CPU times: user 209 ms, sys: 31.9 ms, total: 241 ms
Wall time: 343 ms


## Exploratory Data Analysis

Show distribution of tweet subjects in annotated data

In [6]:
df["subject"].value_counts().rename("count").to_frame().merge(
    (100 * df["subject"].value_counts(normalize=True).rename("fraction").to_frame()),
    left_index=True,
    right_index=True,
    how="left",
).rename_axis("subject").reset_index()

Unnamed: 0,subject,count,fraction
0,misc,370,27.448071
1,Jwst-mission,287,21.290801
2,Other-missions,220,16.320475
3,Nasa-science,127,9.421365
4,images,112,8.308605
5,Jwst-facts,97,7.195846
6,Nasa-careers,45,3.338279
7,Nasa-funding,40,2.967359
8,telescopes,26,1.928783
9,Rover-images,24,1.780415


**Notes**
1. During manual labeling of the tweets, those belonging to an unrelated subject do not contain *Jwst* in the subject name.
2. Some tweets related to *telescopes* might involve discussion related to the current project, but this was not always found to be the case. Additionally, tweets belonging to this subject make up less than 5% of all manually labeled tweets and so we will proceed to discard these from further use.

**Observations**
1. The subjects required for this project are
   - *Jwst-mission*
   - *Jwst-facts*

   so we will focus on building a list of search terms that only efficiently filters tweets belonging to those in these two subjects. This list will not be suitable for use with other subjects but this is acceptable since this project is focused on the above two subjects only.

## Filter Data

Filter tweets to get required subjects

In [7]:
df_subjects = df.query("subject.isin(@wanted_subject_strs)")

From filtered data, get tweets containing wanted
- standalone, or
- joined

terms in the text of the tweet

In [8]:
df_unexpected = df_subjects.query("~text.str.lower().str.contains(@wanted_str)")

Show the leftover tweets that are not captured by the filter term list defined above (`wanted_list`)

In [9]:
print(
    f"Found {len(df_unexpected):,} leftover tweets "
    f"({100*len(df_unexpected)/len(df_subjects):.3f}%) "
    f"out of {len(df_subjects):,} candidate tweets matching the wanted subjects"
)
with pd.option_context("display.max_rows", None, "display.max_colwidth", None):
    display(df_unexpected)

Found 30 leftover tweets (7.812%) out of 384 candidate tweets matching the wanted subjects


Unnamed: 0,text,subject
113,Can’t wait to see this thing in action: NASA’s finest moment since saving Hubble.,Jwst-mission
146,"I am not with NASA so this is just my opinion: It all depends where the tear would be, I imagine that if it is towards the outside of the shield( away from the center) its ok, also, remember that the main sensors are actively cooled (new tech refrigeration)!",Jwst-mission
159,"Might be a little gassy but hey, you go NASA.",Jwst-mission
188,Interesting Fact I learned the other day. The space shuttle was the best looking craft to ever fly.,Jwst-facts
265,Educators - subscribe to NASA Explore newsletter and get a toolkit for bringing the mission into the classroom.,Jwst-mission
277,"While some of us were celebrating Christmas, the scientists at were busy launching the successfully!!!The fine people at NASA have created something to help you follow along as it deploys its many and various parts!",Jwst-mission
283,ICYMI: NASA &amp; CDSE member teams placed the III forward skirt on the Michoud Assembly Facility Vertical Assembly Center robotic weld tool for its next production phase. The forward skirt will host flight computers &amp; avionics:,Jwst-mission
294,part 3: around the spacecraft and then amplify and distribute said energy to thrusters as the craft begins to gain more and more speed due to an infinite amount of energy needed to reach the speed of light. Nuclear fission being used to do that is too risky i would say.,Jwst-facts
320,Then why is there a can spinning in this fake CGI NASA video? Scroll 4mins in. I’ll wait…. Next.,Jwst-mission
349,"I give up. It looks like a space shuttle, how can that be new?",Jwst-facts


**Observations**
1. The search term list does not capture less than 10% of candidate tweets.
2. Some of the leftover tweets are directly related to the current project. However, expanding the search terms list (`wanted_list`) to include terms that would capture these leftover tweets would also result in capturing unrelated tweets. eg. including `NASA` or `nasa` will capture most of these leftover tweets, but will also pick up tweets about other (unrelated) NASA missions that are not relevant for this project.