![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Readability of Privacy Policies

### Instructions
#### “Run” the cells to see the graphs
Click “Cell” and select “Run All”.<br> This will import the data and run all the code, so you can see this week's data visualization. Scroll to the top after you’ve run the cells.<br> 

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

**You don’t need to do any coding to view the visualizations**.
The plots generated in this notebook are interactive. You can hover over and click on elements to see more information. 

Email contact@callysto.ca if you experience issues.

### About this Notebook

Callysto's Weekly Data Visualization is a learning resource that aims to develop data literacy skills. We provide Grades 5-12 teachers and students with a data visualization, like a graph, to interpret. This companion resource walks learners through how the data visualization is created and interpreted by a data scientist. 

The steps of the data analysis process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer? 
2. Gather - Find the data source(s) you will need. 
3. Organize - Arrange the data, so that you can easily explore it. 
4. Explore - Examine the data to look for evidence to answer the question. This includes creating visualizations. 
5. Interpret - Describe what's happening in the data visualization. 
6. Communicate - Explain how the evidence answers the question. 

# Question

So, you've checked off the "I read and agree to the terms and conditions" on a new services you've just signed up to without actually reading the terms. After all, they always seem very long to read and filled with jargon that does not make sense. 

But just how difficult is it to read these terms of services? In this data visualization, we analyze the readability of various data policies to answer this question. 

### Goal
Our goal is to investigate the readability of some of the most popular social media websites to see just how difficult it is to read these policies.

# Gather

### Code:
The code below will import the Python programming libraries we need to gather and organize the data to answer our question.

In [None]:
from bs4 import BeautifulSoup
import requests
import re 
import markdown
import textstat

import pandas as pd
import numpy as np

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

### Data:

We will use the Princeton-Leuven Longitudinal Privacy Policy Dataset which contains over 1 million policies that span more than 20 years. The dataset can be found on the following [repository](https://github.com/citp/privacy-policy-historical).

### Import the data

The code below reads the privacy policies from the following social media websites: Tiktok, Twitter, Facebook, Instagram, YouTube, and Pinterest.

In [None]:
#getting the links for each of the policies
policies = ['https://github.com/citp/privacy-policy-historical/blob/master/t/ti/tik/tiktok.com.md',
            'https://github.com/citp/privacy-policy-historical/blob/master/t/tw/twi/twitter.com.br.md',
            'https://github.com/citp/privacy-policy-historical/blob/master/f/fa/fac/facebook.com.md',
            'https://github.com/citp/privacy-policy-historical/blob/master/i/in/ins/instagram.com.md',
            'https://github.com/citp/privacy-policy-historical/blob/master/y/yo/you/youtube.com.md',
            'https://github.com/citp/privacy-policy-historical/blob/master/p/pi/pin/pinterest.com.md']

In [None]:
#putting the privacy policies into a dataframe
def create_df(alist: list):
    result_dict = {}
    title_list = ['Tiktok', 'Twitter', 'Facebook', 'Instagram', 'Youtube','Pinterest']
    
    for i, url in enumerate(policies): 

        response = requests.get(url)

        content = response.text

        start_index = content.find('<article')
        end_index = content.find('</article>')

        body_content = content[start_index:end_index]

        html = markdown.markdown(body_content)
        soup = BeautifulSoup(html, 'html.parser')
        texts = soup.get_text()
        result_dict[title_list[i]] = texts

    df = pd.DataFrame(result_dict.items(), columns=["Social_Media", "Texts"])
    remove_n = lambda x: x.replace('\n', '')
    df["Texts"] = df["Texts"].apply(remove_n)
    return df

policy_df = create_df(policies)
policy_df

We can also search through the database for a specific company's privacy policy using the folloing code. The function search_for_site will take in the input of the company name you are looking for and give an output of suggested URLs that point to that company's privacy policy.

In [None]:
def search_for_site(keyword):
    
    base_url = "https://github.com/citp/privacy-policy-historical/tree/master/"
    first_letter = keyword[0]
    second_letter = keyword[0:2]
    third_letter = keyword[0:3]
    
    search_url = base_url + first_letter + "/" + second_letter + "/" + third_letter
    r = requests.get(search_url)
    soup = BeautifulSoup(r.text)
    mydivs = soup.find_all("a", {"class": "js-navigation-open Link--primary"})
    
    title_list = []
    for item in mydivs:
        title_list.append(item.get("title"))
        
    result_list = []
    for website in title_list:
        if keyword in website:
            result_list.append(website)
            
    if not result_list:
        print("No website")
        
    if len(result_list) == 1:
        return result_list
    
    if len(result_list) > 1:
        print(result_list)
        
search_for_site("twitter")

# Organize

The code below will calculate the readability score, grade score, as well as the reading time of each of the privacy policies that we are investigating and add them to our dataframe.

The grade score shows the number of years of education required to understand the text. For example, a score of 5 means that a fifth grader will generally understand the text.

The readabilty score shows another measure of grade level that uses a lookup table of the most commonly used 3000 English words to determine the grade level. 

The reading time give the time it will take to read the text. 

For more information on the score values, you can refer to the [textstat library documentation](https://pypi.org/project/textstat/)

In [None]:
readability_score = lambda x: textstat.dale_chall_readability_score(x)
grade_score = lambda x: textstat.flesch_kincaid_grade(x)
reading_time = lambda x: textstat.reading_time(x, ms_per_char=14.69)
policy_df["Readability_score"] = policy_df["Texts"].apply(readability_score)
policy_df["Grade_score"] = policy_df["Texts"].apply(grade_score)
policy_df["Reading_time"] = policy_df["Texts"].apply(reading_time)
policy_df

# Explore

The code below will create a bar graph showing the different measures of readablity for each of the social media websites. You can toggle between reading time, readability score, and grade trace to compare how each of these companies perform.  

In [None]:
readability_score = policy_df["Readability_score"].tolist()
grade_score = policy_df["Grade_score"].tolist()
reading_time = policy_df["Reading_time"].tolist()

readability_trace = go.Bar(x=policy_df["Social_Media"], y=policy_df["Readability_score"], width=0.7,
                           name="Readability Score", marker=dict(color="#1EBC8C"), visible=True)
grade_trace = go.Bar(x=policy_df["Social_Media"], y=policy_df["Grade_score"], 
                     name="Grade Score", width=0.7, marker=dict(color="#4D8AF1"), visible=False)
time_trace = go.Bar(x=policy_df["Social_Media"], y=policy_df["Reading_time"], 
                    name="Reading Time", width=0.7, marker=dict(color="#F14D78"), visible=False)
traces = [readability_trace, grade_trace, time_trace]

fig = go.Figure(data=traces)

fig.update_layout(
    updatemenus=[
        dict(buttons=list([
            dict(label="Readability Score",
                 method="update",
                 args=[{"visible": [True, False, False]}, 
                       {"title":{"text":"Dale Chall Readabilty Score", "x":0.5}}]),
            dict(label="Grade Trace",
                 method="update",
                 args=[{"visible": [False, True, False]}, 
                       {"title":{"text":"Flesch Kincaid Reading Grade", "x":0.5}}]),
            dict(label="Reading Time",
                 method="update",
                 args=[{"visible": [False, False, True]}, 
                       {"title":{"text":"Reading Time", "x":0.5}}]),
        ]),
             direction="down",
             pad={"r": 10, "t": 10},
             showactive=True,
             x=0,
             xanchor="left",
             y=1.15,
             yanchor="top"
            )
    ]  
)

fig.update_layout(
    title={'text': "Dale Chall Readabilty Score",
           'x':0.5,
           'xanchor': 'center',
           'yanchor': 'top'},
    height=500, width=700)

fig.show()

From the graph we can see that the social media websites score quite high in readability score with the lowest value going to Pinterest at 6.58 and the highest value going to TikTok at 7.82.

We can see from the graph below that TikTok's privacy policy takes the least time to read but still ranks as one of the highest for grade trace as well as readablity score. Meabwhile, Twitter has the longest reading time yet its reading score and grade trace is comparable to TikTok.

We can look at the realtionship between reading score, grade trace, and reading time more in depth by investigating the correlations between these three variables in the code below.

In [None]:
def create_correlation_graph(df):
    variables = ["Readability_score", "Grade_score", "Reading_time"]
    colors=["cadetblue", "darksalmon", "limegreen"]
    
    fig = make_subplots(rows=3, cols=3, horizontal_spacing = 0.1)
    
    def create_empty_dataframe():
        return pd.DataFrame()

    start_no = 0
    global col_no
    col_no = 1
    row_no = 1
    while row_no < 4:
        col_no = 1
        for variable in variables:
            slope,y_int=np.polyfit(df[variables[start_no]], df[variable], 1)
            df_slope = create_empty_dataframe()
            df_slope["Best_fit"] = slope * df[variables[start_no]] + y_int
            
            fig.add_trace(go.Scatter(x=df[variables[start_no]], y=df[variable], mode='markers',
                                     marker={"color":colors[start_no]}, 
                                     hovertemplate="%{x}" + "," + " %{y}" + "<extra></extra>"), 
                          row=row_no, col=col_no)
            fig.add_trace(go.Scatter(x=df[variables[start_no]], y=df_slope["Best_fit"], mode="lines",
                                    marker={"color":colors[start_no]}, name="Line of Best Fit", hoverlabel = dict(namelength = 20)), 
                          row=row_no, col=col_no)
            fig.update_xaxes(title_text=variables[start_no], row=row_no, col=col_no)
            fig.update_yaxes(title_text=variable, row=row_no, col=col_no)
            
            if col_no != 3:
                col_no += 1
                
        row_no += 1
        start_no += 1


    fig.update_layout(height=800, width=800, showlegend=False, title={"text":"Correlation Comparison",
                                                                      "x":0.5})
    fig.show()
        
create_correlation_graph(policy_df)

# Interpret
Based on what we see from the graphs above, it seems that the most popular social media sites have quite a high reading score ranging from 6.58 to 7.82. This equates to needing a grade level between an 8th grader to a 10th grader to be able to understand the text. 

The grade trace values are also quite high, ranging from 10.8 to 14.6. This means that a minimum of 10.8 years of educaiton are needing to understand these privacy policies while some require as much as over 14 years of education to be understood. 

We also see that most of the privacy policies take quite a long time to read, but that the time it takes to read a privacy policy does not necessarily effect the readability score. That means that even short privacy policies are written in a way that makes them hard to understand. 

# Taking a further look: web scraping

Want to analyze readabily of text you find online? You can use the following code to scrape a webpage you want to analyze. The example below is a web scraping of Google's privacy policy

In [None]:
# Reading time, readability score, grade trace 

def webscrape_private_policy(url):
    search_url = url

    r = requests.get(search_url)
    soup = BeautifulSoup(r.text)
    mydivs = soup.find_all("div", {"class": "nrAB0c"})
    contents_list = []
    for item in mydivs:
        contents_list.append(item.find("p"))
        contents_list.append(item.find("ul"))

    cleaned_contents_list = []
    for ind, content in enumerate(contents_list):
        content = str(content)
        if content != "None":
            content = re.sub('<.*?>', '', content)
            cleaned_contents_list.append(content)
        if content == "None":
            continue
            
    cleaned_text = ''.join([content for content in cleaned_contents_list])
    return cleaned_text

url = "https://policies.google.com/privacy?hl=en-US"
scraped_texts = webscrape_private_policy(url)
scraped_texts

# Communicate
Below are some writing prompts to help you reflect on the new information that is presented from the data. When we look at the evidence, think about what you perceive about the information. Is this perception based on what the evidence shows? If others were to view it, what perceptions might they have?

I used to think ____________________but now I know____________________.
I wish I knew more about ____________________.
This visualization reminds me of ____________________.
I really like ____________________.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)