# CMN512V - Social Science with Online Data 

-------

## Module 4 Demo: Web History

## Intro

You may have heard that tech companies use our behavior online to deliver targeted ads, or that the government can demand access to this data, but what can the data that big tech companies have about us really reveal?

![](https://drive.google.com/file/d/1j4136RMOVolt6Vsn4IZLKjhyJAFaBaSv/view?usp=sharing)

<img src="https://raw.githubusercontent.com/erickaakcire/wh_course_activity/main/confess-to-murder.png" alt="json" width="500"/>

More seriously, the recent repeal of the constitutional right to abortion in the United States, and the subsequent abortion restrictions and bans in some States prompted some to speculate that [search history could be used as evidence in court](https://www.washingtonpost.com/technology/2022/07/01/google-privacy-abortion/) to prove an illegal abortion has taken place. The author of this opinion said that companies like Google should keep less information about all users to protect people from these types of dangers.

In this demo you will look at what [Google has predicted about you and your interests](https://adssettings.google.com/) based on your search history and all of the data it has about you, and you will examine your own search history by accessing data that is already on your computer.

## Our Strategy

You have been asked to use the Chrome web browser on your desktop computer over the past few weeks. This demo depends on you having done so. We will use an oper source Chrome extension to show you your overall browsing data, and to provide you with a `.json` file to analyze in this demo. This extension does not collect any of your data, it just enables your browser to show you the data it has locally. For more background about the extension see [its website](http://webhistorian.org).

### Gathering data

Install the [Web Historian Educational Edition Chrome extension](https://chrome.google.com/webstore/detail/web-historian-education-e/chpcblajbmmlbhecpnnadmjmlbhkloji). Feel free to explore the visualizations (particularly the Search Terms), but for this demo you just need to click the "Download JSON" button on the front page. This will give you a file called `web_historian_data.json`.

If you have not been able to use Chrome on a desktop, or if you want to use demo data, just download 
 <a href="https://raw.githubusercontent.com/erickaakcire/wh_course_activity/main/web_historian_data_demo.json" download>this `web_historian_data_demo.json` file</a>. If you see a text file when you click on this link, just go to `File > Save page` as to save the `.json` file.

### Your privacy in this class
This notebook is written in a way that respects your privacy.  Only you will see the analysis of your data, and when you upload your file to this notebook, the contents of that file will be forgotten when you download or submit the notebook, when you refresh the page, or when you click `"Runtime" -> "Reset all runtimes ..."`.  The only thing to be concerned about is that the outputs of your code will remain visible by default.  To keep your browsing data results private in this class:
*   Don't share a copy of this notebook that exposes your output data to others.  At the top of this notebook you can click `"Edit" -> "Clear all outputs"` to hide the results of your code.   
*   For any work you actually submit, switch out your data file for the  <a href="https://raw.githubusercontent.com/erickaakcire/wh_course_activity/main/web_historian_data_demo.json" download>sample `.json` file</a> that I provide.  

## Authentication
Run this to create the upload prompt and upload a Web Historian `.json` file, either your own or the sample provided.

In [12]:
#from google.colab import files # uncomment for Google Colab
#uploaded = files.upload() # uncomment for Google Colab
#uploaded = list( uploaded.values() ).pop().decode('utf-8') # uncomment for Google Colab
from ipywidgets import FileUpload
uploaded = FileUpload(multiple=False, accept='.json')
uploaded

FileUpload(value={}, accept='.json', description='Upload')

Once you've selected the file, run the code below to load it.  You will now be ready to run the rest of the code in this notebook.

This is just the first few lines of your browser history, as displayed in Web Historian.

In [13]:
# You don't have to understand this line.  
# Just know it takes your data file and turns it into a Pandas dataframe
import pandas as pd
df = pd.read_json(uploaded.data[0])
df

Unnamed: 0,id,url,urlId,protocol,domain,searchTerms,date,transType,refVisitId,title
0,176954,https://www.insidehighered.com/news/2022/06/07...,44087,https,insidehighered.com,,2022-06-07 11:46:11.476375040,auto_toplevel,176953,Simulations help students recognize mental dis...
1,176956,https://erickaakcire.github.io/emt-cv-web.pdf,43772,https,github.io,,2022-06-07 13:34:31.839195136,link,0,emt-cv-web
2,177094,https://eu-west-1.console.aws.amazon.com/athen...,27078,https,amazon.com,,2022-06-07 13:40:49.725507072,typed,0,Athena
3,177097,https://eu-west-1.signin.aws.amazon.com/oauth?...,44091,https,amazon.com,,2022-06-07 13:40:50.541009920,link,177096,Amazon Web Services Sign-In
4,177098,https://eu-west-1.signin.aws.amazon.com/oauth?...,44092,https,amazon.com,,2022-06-07 13:42:41.296814080,link,0,Amazon Web Services Sign-In
...,...,...,...,...,...,...,...,...,...,...
11653,212702,https://stackoverflow.com/questions/63215752/h...,55971,https,stackoverflow.com,,2022-09-05 00:38:03.568451840,link,0,python - How to use FileUpload widget in jupyt...
11654,212703,http://localhost:8888/notebooks/Module%204%20D...,55958,http,localhost:8888,,2022-09-05 00:41:24.483062016,link,0,Module 4 Demo - Web History - Jupyter Notebook
11655,212713,https://www.google.com/search?q=FutureWarning%...,55974,https,google.com,FutureWarning: The default value of regex will...,2022-09-05 00:54:11.958309888,generated,0,FutureWarning: The default value of regex will...
11656,212716,https://wrlc.org/,55976,https,wrlc.org,,2022-09-05 00:59:22.121107200,link,0,Welcome to the Washington Research Library Con...


### First, We will need to select just the search data and remove some duplicates.

In [16]:
df['any_query'] = df['url'].str.extract(r'[&\?]q=([^&\?]*)')
df['any_query'] = df['any_query'].str.replace(r'%20', ' ', regex=True) 
df['any_query'] = df['any_query'].str.replace(r'\+', ' ', regex=True)
df_query = df[df['any_query'].notnull()==True][['date','any_query','domain','url']]
df_query
# to dedup group by day and domain, take the earliest date

Unnamed: 0,date,any_query,domain,url
39,2022-06-07 14:14:26.823367168,presto limit decimal places,google.com,https://www.google.com/search?q=presto+limit+d...
40,2022-06-07 14:14:27.961601024,presto limit decimal places,google.com,https://www.google.com/search?q=presto+limit+d...
53,2022-06-07 14:53:51.270210816,jekyll-theme-cayman,google.com,https://www.google.com/search?q=jekyll-theme-c...
54,2022-06-07 14:53:52.080195072,jekyll-theme-cayman,google.com,https://www.google.com/search?q=jekyll-theme-c...
64,2022-06-07 16:15:32.042872832,ubuntu is using 100%25 disk,google.com,https://www.google.com/search?q=ubuntu+is+usin...
...,...,...,...,...
11633,2022-09-05 00:19:11.167385088,wh,github.com,https://github.com/?q=wh
11634,2022-09-05 00:19:13.278363136,wh_,github.com,https://github.com/?q=wh_
11651,2022-09-05 00:37:50.819473152,from ipywidgets import FileUpload,google.com,https://www.google.com/search?q=from+ipywidget...
11652,2022-09-05 00:37:51.453256960,from ipywidgets import FileUpload,google.com,https://www.google.com/search?q=from+ipywidget...


You will probably see a major search engine dominating these results, such as Google or Bing, but this query will also surface other sites where you search. To see which domains are included, run the next cell:

In [15]:
df_query['domain'].value_counts()

google.com                  1347
thenounproject.com            42
memegine.com                  22
github.com                    15
scholar.google.de              6
wrlc.org                       5
twitter.com                    4
upwork.com                     4
montgomeryschoolsmd.org        1
oup.com                        1
console.cloud.google.com       1
cornell.edu                    1
Name: domain, dtype: int64

If you want to see the searches from a particular domain, change the name of the domain in the code below.

In [None]:
# add code here, plus more exploration

Now that you have a better sense of what kind of data a search engine company has about you, review what [Google has predicted about you and your interests](https://adssettings.google.com/). 

You have only explored here the data from one web browser on one device, Google has this informaiton from every device where you login. To see more of the data Google has about you see [Google My Activity](https://myactivity.google.com/).

# Summary

* We have used Python to explore some of the data that is used by companies to target advertising.
* We have also explored some of the predictions that Google makes about us based on this type of data.
* We learned that the kinds of detailed data about our behaviors that companies have could be considered quite invasive of our privacy and that when companies have such data it could be requested by the government.