# Generating tf-idf values for workshop descriptions

Takes advantage of Melanie Walsh's great work here: https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/TF-IDF-Scikit-Learn.html



TODOS:
- Remove numbers
- Drop to just unigrams and bigrams?
- Custom stopwords to remove standard text from descriptions that is about physical libraries, ask us, etc.
  - Can we remove chunks of text even we get to vectorizer?
- What are we doing with tf-idf values?
  - Try to match the tf-idf values with queries (queries could be from user or keywords from catalog/metadata)
- How often should the vector space be updated? For each new query or can we determine a standard set of features we expect for some time, and just project new items into that?

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
df = pd.read_csv("all-workshops-2021-02-04.csv")

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,nid,field_time_d8,body,field_registration_url,field_non_library_instructor,field_workshop_leads_export,field_workshop_series,field_workshop_user_activities,field_non_libraries_space_1,field_space
0,0,Orientation: Digital Media Making in the Libra...,53062,02-04-2019 11:00AM to 02-04-2019 11:45AM,"Excited to make videos and movies, record podc...","<a href=""https://reporter.ncsu.edu/link/instan...",,"[{'id': '504', 'url': 'https://www.lib.ncsu.ed...",Digital Media,,,"<a href=""/spaces/digital-media-lab"" hreflang=""..."
1,1,"MATLAByrinth Part 2: Getting Started, Part Two!",52895,02-04-2019 2:00PM to 02-04-2019 4:00PM,"MATLAByrinth, is a series of four comprehensiv...","<a href=""https://reporter.ncsu.edu/link/instan...","\nAmrutha Raghu, AmruthaRaghu.jpg\n\n",[],,,,"<a href=""/spaces/teaching-and-visualization-la..."
2,2,Orientation: Digital Media Making in the Libra...,53194,02-04-2019 4:30PM to 02-04-2019 5:00PM,"Excited to make videos and movies, record podc...","<a href=""https://reporter.ncsu.edu/link/instan...",,[],Digital Media,,,"<a href=""/spaces/4k-video-studio"" hreflang=""un..."
3,3,Orientation: Digital Media Making in the Libra...,53199,02-04-2019 4:30PM to 02-04-2019 5:00PM,"Excited to make videos and movies, record podc...","<a href=""https://reporter.ncsu.edu/link/instan...",,[],Digital Media,,,"<a href=""/spaces/digital-media-lab"" hreflang=""..."
4,4,Virtual Reality Studio Orientation,53134,02-04-2019 6:00PM to 02-04-2019 7:00PM,This orientation is required to access the D. ...,"<a href=""https://reporter.ncsu.edu/link/instan...",\nAnthony Chaanine\n\n,[],Virtual and Augmented Reality,"Virtual Reality &amp; Augmented Reality, Virtu...",,"<a href=""/spaces/vr-studio"" hreflang=""und"">VR ..."


In [4]:
corpus = df["body"].to_list()

In [5]:
vectorizer = TfidfVectorizer(stop_words="english", ngram_range=(1, 3))

In [6]:
X = vectorizer.fit_transform(corpus)

In [7]:
X.shape

(1300, 29595)

In [13]:
tfidf_df = pd.DataFrame(X.toarray(), index=df["title"], columns=vectorizer.get_feature_names())

In [19]:
tfidf_df.head()

Unnamed: 0_level_0,00,00 00pm,00 00pm lunch,00 coffee,00 coffee breakfast,00 deep,00 deep dive,00 lightning,00 lightning talks,00 need,...,zoom situation,zoom situation covid,zoom submitted,zoom submitted invited,zoom webconference,zoom webconference link,zoom webconferencing,zoom webconferencing catch,zoom workshop,zoom workshop intended
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Orientation: Digital Media Making in the Libraries,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"MATLAByrinth Part 2: Getting Started, Part Two!",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Orientation: Digital Media Making in the Libraries,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Orientation: Digital Media Making in the Libraries,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Virtual Reality Studio Orientation,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
tfidf_df.loc["document_frequency"] = (tfidf_df > 0).sum()

In [15]:
tfidf_df.loc["document_frequency"]

00                            4.0
00 00pm                       1.0
00 00pm lunch                 1.0
00 coffee                     1.0
00 coffee breakfast           1.0
                             ... 
zoom webconference link       1.0
zoom webconferencing          1.0
zoom webconferencing catch    1.0
zoom workshop                 1.0
zoom workshop intended        1.0
Name: document_frequency, Length: 29595, dtype: float64

In [17]:
tfidf_df.loc["document_frequency"].sort_values()

zoom workshop intended                1.0
knowledge librarian                   1.0
knowledge librarian registration      1.0
techniques deploy machine             1.0
techniques deploy                     1.0
                                    ...  
required                            678.0
learn                               720.0
hill jr                             735.0
jr                                  736.0
hill                                769.0
Name: document_frequency, Length: 29595, dtype: float64

In [18]:
tfidf_df = tfidf_df.drop("document_frequency")

In [20]:
tfidf_df.stack().reset_index()

Unnamed: 0,title,level_1,0
0,Orientation: Digital Media Making in the Libra...,00,0.0
1,Orientation: Digital Media Making in the Libra...,00 00pm,0.0
2,Orientation: Digital Media Making in the Libra...,00 00pm lunch,0.0
3,Orientation: Digital Media Making in the Libra...,00 coffee,0.0
4,Orientation: Digital Media Making in the Libra...,00 coffee breakfast,0.0
...,...,...,...
38473495,IRB Basics: eIRB Application Workshop,zoom webconference link,0.0
38473496,IRB Basics: eIRB Application Workshop,zoom webconferencing,0.0
38473497,IRB Basics: eIRB Application Workshop,zoom webconferencing catch,0.0
38473498,IRB Basics: eIRB Application Workshop,zoom workshop,0.0


In [21]:
tfidf_df_stacked = tfidf_df.stack().reset_index() 

In [25]:
tfidf_df_stacked.columns

Index(['title', 'term', 0], dtype='object')

In [27]:
tfidf_df_stacked.columns = ["title", "term", "tfidf"]
tfidf_df_stacked.head()

Unnamed: 0,title,term,tfidf
0,Orientation: Digital Media Making in the Libra...,00,0.0
1,Orientation: Digital Media Making in the Libra...,00 00pm,0.0
2,Orientation: Digital Media Making in the Libra...,00 00pm lunch,0.0
3,Orientation: Digital Media Making in the Libra...,00 coffee,0.0
4,Orientation: Digital Media Making in the Libra...,00 coffee breakfast,0.0


In [28]:
tfidf_df_stacked.sort_values(by=['title', 'tfidf'], ascending=[True, False]).groupby(["title"]).head(10)

Unnamed: 0,title,term,tfidf
29630556,Introduction to Data Management Plans for Res...,data management,0.269877
29639767,Introduction to Data Management Plans for Res...,management,0.217663
29630399,Introduction to Data Management Plans for Res...,data,0.196784
29630560,Introduction to Data Management Plans for Res...,data management plan,0.149959
29652880,Introduction to Data Management Plans for Res...,water,0.145300
...,...,...,...
10822929,"You buy a pair of shoes online, you get an adv...",recommendation,0.100177
10822888,"You buy a pair of shoes online, you get an adv...",recent,0.094190
10829505,"You buy a pair of shoes online, you get an adv...",user,0.065110
10822456,"You buy a pair of shoes online, you get an adv...",python,0.062591


In [29]:
top_tfidf = tfidf_df_stacked.sort_values(by=['title', 'tfidf'], ascending=[True, False]).groupby(["title"]).head(10)

In [30]:
top_tfidf.to_csv("workshops_tfidf.csv")

In [32]:
top_tfidf[top_tfidf["term"].str.contains("python")]

Unnamed: 0,title,term,tfidf
493801,Data Cleaning with Python,python,0.162281
3009376,Data Cleaning with Python,python,0.162281
32367616,Data Manipulation with Python,python,0.134526
32545186,Data Visualization with Python,python,0.132479
32604376,Data Visualization with Python,python,0.132479
34853596,Data Visualization with Python,python,0.132479
12331801,Introduction to Computer Vision and Image Proc...,python,0.16547
12312614,Introduction to Computer Vision and Image Proc...,analysis experience python,0.097297
32278831,Introduction to Programming with Python,python,0.126403
34172911,Introduction to Programming with Python,python,0.126403


In [33]:
top_tfidf[top_tfidf["term"].str.contains("visualization")]

Unnamed: 0,title,term,tfidf
13316180,Data Visualization with R,visualizations,0.154358
18880040,Data Visualization with R,visualizations,0.154358
24177545,Data Visualization with R,visualizations,0.154358
30718040,Data Visualization with R,visualizations,0.154358
827047,Elements of Visualization Design,visualization,0.175923
17903362,Elements of Visualization Design,visualization,0.175923
21928282,Elements of Visualization Design,visualization,0.175923
28883107,Elements of Visualization Design,visualization,0.175923
36074692,Elements of Visualization Design,visualization,0.175923
3520192,Graphing in Excel,visualization,0.140626


In [35]:
# Notice that titles are still case sensitive
top_tfidf[top_tfidf["title"].str.contains("Python")]

Unnamed: 0,title,term,tfidf
493801,Data Cleaning with Python,python,0.162281
3009376,Data Cleaning with Python,python,0.162281
476491,Data Cleaning with Python,bring,0.148551
2992066,Data Cleaning with Python,bring,0.148551
475320,Data Cleaning with Python,askus desk bring,0.135371
...,...,...,...
35017947,Webscraping with Python,discuss legal ethical,0.098065
35019320,Webscraping with Python,ethical issue,0.098065
35019321,Webscraping with Python,ethical issue involved,0.098065
35020171,Webscraping with Python,file ll,0.098065
