In [1]:
#https://rise.readthedocs.io/en/stable/usage.html

2

# Breakout Session Goals
In this Breakout session we will analyze our own Google Search History Takeouts to gain insights on 1) what we could potentially learn from this data source and 2) what the practical challenges of gathering data would be.

# Personal data downloads / takeouts
- Came about due to [GDPR](https://gdpr.eu/what-is-gdpr/)
- Allows researchers to partner with individuals - data donation approach
- Still subject to the data formatting of platforms which could change any time
- Can be confusing for participants to accomplish
- Can be hard hard for participants to see what data they are donating (and thus provide *informed* consent)

# Personal data downloads / takeouts
- HUGE advantage over other approaches, device agnostic - anywhere one is logged into the platform the data are collected
- Historical approach avoids '[hawthorn](https://en.wikipedia.org/wiki/Hawthorne_effect) / observation' effects issues
- Single platform - depends how concentrated individuals are to a single platform -- Google Search is widely used and [the search engine market overall is highly concentrated](https://gs.statcounter.com/search-engine-market-share)

# Introductions
- Depending on what introductions we have already done in the main session: 
    - Name
    - Title
    - University
    - Your interest in takeouts/search traces
    - Any prior experience with related approaches or data?

# How to download your search history from Google
- Go to [takeout.google.com](https://takeout.google.com)
- Click "Deselect all" button (so you are not downloading everything which would take a long time!)
- Check "My Activity"
- Click "multiple formats" and change from HTML to JSON and click OK
- At the bottom of the page click "Next Step"
- Click "Create export"

# How to download your search history from Google

- The Export Progress message will say it could take hours or days to complete, but it will only take a few minutes.
- You can view your export in progress: [https://takeout.google.com/takeout/downloads](https://takeout.google.com/takeout/downloads) you'll see it available for download soon. You may need to refresh the page.
- Click Download

# How to follow along in Python
- If you already know how to clone a github repository and open an ipynb file, here is the repository link: https://github.com/erickaakcire/explore_google_takeout
- If not I will show you how to open it in Google Colab (no need to install python, etc.). **Everyone can follow along!**

1. Login to Google and go to https://drive.google.com/ 
2. Create a new Google Colab document by going to New > More > Google Colabratory
    - If you do not see a "Google Colabratory" option, go to "Connect more apps" and choose Google Colabratory
    - When you have done this you have created a new blank ipynb python file on Googles servers
3. Now we need to import the ipynb file into your new Google Colab document. Go to File > Upload Notebook > click the Github tab. Enter this Github URL: https://github.com/erickaakcire/explore_google_takeout

# Load your json file into Pandas
- If you are working locally, just change the path to your own
- If you are working on Google Colab, first upload the MyActivity.json file that you find in the Search folder to Google Colab. Click on the folder icon on the left, which will open up a file view where you can drag your file. To get the path of this file, click the three dots menu on the right of the file and choose "copy path" then just paste what you have inside the quotation marks in the cell below.

In [2]:
import pandas as pd
df = pd.read_json("/Users/emt/Downloads/Takeout 6/My Activity/Search/MyActivity.json")
df.head()

Unnamed: 0,header,title,titleUrl,time,products,activityControls,locationInfos,details,subtitles
0,Search,Visited How to Add R to Jupyter Notebook ? - G...,https://www.google.com/url?q=https://www.geeks...,2022-05-18T23:13:57.036Z,[Search],[Web & App Activity],,,
1,Search,Visited dyld : Library not loaded: Reason: ima...,https://www.google.com/url?q=https://www.biost...,2022-05-18T23:09:28.840Z,[Search],[Web & App Activity],,,
2,Search,Searched for dyld: Library not loaded: @rpath/...,https://www.google.com/search?q=dyld:+Library+...,2022-05-18T23:09:14.320Z,[Search],[Web & App Activity],"[{'name': 'At this general area', 'url': 'http...",,
3,Search,Visited Jupyter Notebook : HTTP 404: Not Found...,https://www.google.com/url?q=https://github.co...,2022-05-18T23:03:00.510Z,[Search],[Web & App Activity],,,
4,Search,Searched for jupyter notebook r kernel not sta...,https://www.google.com/search?q=jupyter+notebo...,2022-05-18T23:02:56.912Z,[Search],[Web & App Activity],"[{'name': 'At this general area', 'url': 'http...",,


# Goal

## Display frequent search terms (unigrams) (overall, by month)

1. "searched for" and "visited" record types must be distinguished
2. Extract the month/year as a variable
3. Lower case the title field, take out punctuation, split by space to create a clean search words array
4. Remove stop words for your language(s)
5. Group by month and view the top X words per month

# First, general exploration

In [6]:
df.tail()

Unnamed: 0,header,title,titleUrl,time,products,activityControls,locationInfos,details,subtitles
50219,Search,Visited http://sinosphere.blogs.nytimes.com/20...,https://www.google.com/url?q=http://sinosphere...,2014-04-03T08:20:26.348Z,[Search],[Web & App Activity],,,
50220,Search,Visited http://en.wikipedia.org/wiki/Bo_Xilai,https://www.google.com/url?q=http://en.wikiped...,2014-04-03T08:20:22.179Z,[Search],[Web & App Activity],,,
50221,Search,Visited http://www.bbc.com/news/world-asia-chi...,https://www.google.com/url?q=http://www.bbc.co...,2014-04-03T08:20:20.328Z,[Search],[Web & App Activity],,,
50222,Search,Searched for bo xilai,https://www.google.com/search?q=bo+xilai,2014-04-03T08:20:11.424Z,[Search],[Web & App Activity],,,
50223,Search,Searched for 10 megabytes in bytes,https://www.google.com/search?q=10+megabytes+i...,2014-04-02T14:39:42.060Z,[Search],[Web & App Activity],,,


In [2]:
df.shape

(50224, 9)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50224 entries, 0 to 50223
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   header            50224 non-null  object
 1   title             50224 non-null  object
 2   titleUrl          50019 non-null  object
 3   time              50224 non-null  object
 4   products          50224 non-null  object
 5   activityControls  50224 non-null  object
 6   locationInfos     19892 non-null  object
 7   details           71 non-null     object
 8   subtitles         11 non-null     object
dtypes: object(9)
memory usage: 3.4+ MB


# 'time' is not a timestamp!

In [6]:
df['time'] = pd.to_datetime(df['time'], infer_datetime_format=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50224 entries, 0 to 50223
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   header            50224 non-null  object             
 1   title             50224 non-null  object             
 2   titleUrl          50019 non-null  object             
 3   time              50224 non-null  datetime64[ns, UTC]
 4   products          50224 non-null  object             
 5   activityControls  50224 non-null  object             
 6   locationInfos     19892 non-null  object             
 7   details           71 non-null     object             
 8   subtitles         11 non-null     object             
dtypes: datetime64[ns, UTC](1), object(8)
memory usage: 3.4+ MB


In [3]:
df.describe()

Unnamed: 0,header,title,titleUrl,time,products,activityControls,locationInfos,details,subtitles
count,50224,50224,50019,50224,50224,50224,19892,71,11
unique,1,42066,42118,49818,1,1,1363,39,11
top,Search,Visited Google Search,https://www.google.com,2021-03-24T01:58:44.184Z,[Search],[Web & App Activity],"[{'name': 'At this general area', 'url': 'http...",[{'name': 'Referred from timeanddate.com'}],"[{'name': 'Including topics:'}, {'name': 'Dona..."
freq,50224,334,334,3,50224,50224,5798,6,1


# Three columns have only 1 value

In [7]:
df = df[['title','time']]

In [8]:
df['year'] = df['time'].dt.year
df['month'] = df['time'].dt.month
df.head()

Unnamed: 0,title,time,year,month
0,Visited How to Add R to Jupyter Notebook ? - G...,2022-05-18 23:13:57.036000+00:00,2022,5
1,Visited dyld : Library not loaded: Reason: ima...,2022-05-18 23:09:28.840000+00:00,2022,5
2,Searched for dyld: Library not loaded: @rpath/...,2022-05-18 23:09:14.320000+00:00,2022,5
3,Visited Jupyter Notebook : HTTP 404: Not Found...,2022-05-18 23:03:00.510000+00:00,2022,5
4,Searched for jupyter notebook r kernel not sta...,2022-05-18 23:02:56.912000+00:00,2022,5


# We just need searches, not visits

In [11]:
df['search terms'] = df['title'].str.extract(r'Searched for (.*)', expand=False)

In [25]:
df = df.loc[(df['search terms'].notnull())]
df.head()

Unnamed: 0,title,time,year,month,search terms
2,Searched for dyld: Library not loaded: @rpath/...,2022-05-18 23:09:14.320000+00:00,2022,5,dyld: Library not loaded: @rpath/libreadline.6...
4,Searched for jupyter notebook r kernel not sta...,2022-05-18 23:02:56.912000+00:00,2022,5,jupyter notebook r kernel not starting Error o...
6,Searched for jupyter notebook r kernel not sta...,2022-05-18 23:01:41.527000+00:00,2022,5,jupyter notebook r kernel not starting
8,Searched for hello world in r,2022-05-18 22:58:12.888000+00:00,2022,5,hello world in r
10,Searched for add r to jupyter notebook,2022-05-18 22:51:09.599000+00:00,2022,5,add r to jupyter notebook


# cleaning

In [26]:
df['search terms'] = df['search terms'].str.lower()

In [29]:
import string

def 
    txt.strip(string.punctuation)


AttributeError: 'Series' object has no attribute 'strip'

In [28]:
df['search terms'].str.split(' ', expand=False)

2        [dyld:, library, not, loaded:, @rpath/libreadl...
4        [jupyter, notebook, r, kernel, not, starting, ...
6            [jupyter, notebook, r, kernel, not, starting]
8                                    [hello, world, in, r]
10                         [add, r, to, jupyter, notebook]
                               ...                        
50215                                 [20, kg, in, poinds]
50217                           [erasmus, university, map]
50218                        [chicco, -, lite, way, buggy]
50222                                          [bo, xilai]
50223                           [10, megabytes, in, bytes]
Name: search terms, Length: 26948, dtype: object