# 2.  Data Cleansing


This is the ```second``` notebook of N-Part series of DSS challenge 2 notebooks.

- [Data Information](01_data_information)

- Data Cleaning 

- [Data Analysis](03_data_analysis.ipynb)


In this notebook, we will cleanup the page content data. This is a continuation of the [Data Information]() notebook where we had a first look at the data and parsed the excel file to get data in csv format.


### Highlights
- Understanding the Information
- Clean column names
- Split complex colums to multiple columns
- Extract information from Page Column
- Generate different dataframes for different types of pages: weekly_brainpost

Here, we will do the data cleaning.

Let's start by loading the data.

In [34]:
import pandas as pd
import brainpostpy.brainpost_data_tidy as tidy
from urllib.parse import urlparse, parse_qs
import numpy as np

In [2]:
DATA_DIR = 'data'
content_data = 'content_data.csv'
page_view_data = 'page_view_data.csv' 

In [3]:
content_data_df = pd.read_csv(f'{DATA_DIR}/{content_data}')

In [4]:
content_data_df.head(2)

Unnamed: 0,Page,Source / Medium,Date Range,Pageviews,Unique Pageviews,Avg. Time on Page,Entrances,Bounce Rate,% Exit,Page Value
0,/weekly-brainpost/tag/sharp-wave+ripples,google / organic,"Oct 25, 2020 - Oct 31, 2020",1,1,00:00:00,1,1.0,1.0,0
1,/weekly-brainpost/tag/sharp-wave+ripples,google / organic,"Oct 18, 2020 - Oct 24, 2020",0,0,00:00:00,0,0.0,0.0,0


#### Understanding the data Columns

- Page
    - Page column contains the page information from the URL. The data here has plenty of information related to the page. We just need to organize it.
- Source/Medium
    - This is the source of the traffic. whether the traffic is organic or from Google or both.
- Date range
    - This is the date range for the information has start and end date separated by a hyphen.
- Pageviews
    - Count of page views
- Unique page Views
    - page views from unique user
- Average time spent on page
    - This is also self explanatory. The average time is given in the format HH:mm:ss
- Entrances
    - Number of entrances to the page
- Bounce Rate
    - Bounce rate of the page
- % Exit
    - Percentage of exit 
- Page Value
    - [Page value](https://support.google.com/analytics/answer/2695658?hl=en)
    
    

There is so much information, but we need to tidy up the data before we can see some useful insights.

### Data Cleanup
- Let's start cleaning up.  First of all let's first rename the columns.

In [5]:
content_data_df = tidy.tidy_column_names(content_data_df)
content_data_df.head(1)

Unnamed: 0,page,source_medium,date_range,pageviews,unique_pageviews,avg_time_on_page,entrances,bounce_rate,percent_exit,page_value
0,/weekly-brainpost/tag/sharp-wave+ripples,google / organic,"Oct 25, 2020 - Oct 31, 2020",1,1,00:00:00,1,1.0,1.0,0


- Let's keep the ```page``` column as it is for a while. There is a lot to do with that column. we can do it later.
- Second column is ```source_medium```, It is a good idea to split ```source``` and ```column```.
- We can do the same for ```date_range``` column. Date range can be better represented in two columns as ```start_date``` and ```end_date```.

In [6]:
#source_medium
content_data_df[['source','medium']] = content_data_df.source_medium.str.split("/",expand=True,)
content_data_df = content_data_df.drop(['source_medium'], axis=1)
#date_range
content_data_df[['start_date','end_date']] = content_data_df.date_range.str.split("-",expand=True,)
content_data_df = content_data_df.drop(['date_range'], axis=1)
content_data_df.head(1)
content_data_df.columns.str.strip() #remve space aroudn the words

Index(['page', 'pageviews', 'unique_pageviews', 'avg_time_on_page',
       'entrances', 'bounce_rate', 'percent_exit', 'page_value', 'source',
       'medium', 'start_date', 'end_date'],
      dtype='object')

- Now, the time in ```avg_time_on_page``` column is in hour:minute:seconds format. Let's make it numeric by converting to seconds.

In [7]:
content_data_df['avg_time_on_page'] = content_data_df['avg_time_on_page'].apply(tidy.time_hhmmss_to_sec)

Let's have a look at ```page_value``` column.
[Page Value](https://support.google.com/analytics/answer/2695658?hl=en) is the average value for a page that a user visited before landing on the goal page or completing an Ecommerce transaction (or both). 

In [8]:
content_data_df['page_value'].unique()

array([0])

 ```page_value``` column does not have any meaningful data, hence it's not worth keeping. Let's remove that column.

In [9]:
content_data_df = content_data_df.drop(['page_value'], axis=1)
content_data_df.head(1)

Unnamed: 0,page,pageviews,unique_pageviews,avg_time_on_page,entrances,bounce_rate,percent_exit,source,medium,start_date,end_date
0,/weekly-brainpost/tag/sharp-wave+ripples,1,1,0,1,1.0,1.0,google,organic,"Oct 25, 2020","Oct 31, 2020"


###### Now let's get some useful information from the ```page``` column
- As the ```page``` column gives us information about the page url, as the url is well optimized, it gives clear information about which category the page belongs. We can see some tag names too. So let's clean it up...
- We are interested in paths and querystrings let's check what kind of query strings we have.

In [10]:
page_vals = content_data_df['page'].values

From observation of values in page columns, we can see there are some pages starting with translate_c, they are the pages visited through google translate. Let's first do two operations:
1. Find actual page url in the translate url, and use that for further analysis.
2. Prepare data for translation statistics.

In [11]:
for index, row in content_data_df.iterrows():
    if '/translate_c' in row['page']:
        print(tidy.get_page_if_translated(row['page']))

/weekly-brainpost/2020/10/6/neuronal-computation-underlying-inference-in-the-brain?fbclid=IwAR25ei5RFRF3Bt7VmQu4It6HIKiyPbXAf3yQQIFKN7b_MwvxI5a
/weekly-brainpost/2020/10/6/neuronal-computation-underlying-inference-in-the-brain?fbclid=IwAR25ei5RFRF3Bt7VmQu4It6HIKiyPbXAf3yQQIFKN7b_MwvxI5a
/weekly-brainpost/2018/6/19/stress-hormones-sensitize-fear-circuits-in-the-brain
/weekly-brainpost/2018/6/19/stress-hormones-sensitize-fear-circuits-in-the-brain
/brainpost-life-hacks/2019/1/2/new-year-new-me-the-neuroscience-of-habit-formation
/brainpost-life-hacks/2019/1/2/new-year-new-me-the-neuroscience-of-habit-formation
/weekly-brainpost/2020/7/28/cortical-network-responses-and-visual-semantics-of-movie-fragments
/weekly-brainpost/2020/7/28/cortical-network-responses-and-visual-semantics-of-movie-fragments
/weekly-brainpost/2020/6/23/decoding-of-natural-sounds-in-congenitally-blind-individuals
/weekly-brainpost/2020/6/23/decoding-of-natural-sounds-in-congenitally-blind-individuals
/weekly-brainpos

We will replace all pages having translate_c in it with the real url later in the cleaning process. Before that let's prepare a differnt dataframe for the translation information.

In [12]:
count_translations = 0
translation_df = pd.DataFrame(columns={'page', 'language'})
pages = []
languages = []
for index, row in content_data_df.iterrows():
    if '/translate_c' in row['page']:
        count_translations = count_translations + 1
        parsed_page = urlparse(row["page"])
        ppq = parsed_page.query
        split_query = ppq.split('&')
        for split_pair in split_query:
            if split_pair.startswith('hl='):
                languages.append(split_pair[3:])
            elif split_pair.startswith('u='):
                pages.append(split_pair[2:].replace('https://www.brainpost.co/','/'))
translation_df['page'] = pages
translation_df['language'] = languages
print(f"Total Number of times the translated pages were viewed: {count_translations}")

Total Number of times the translated pages were viewed: 12


Now we have translated dataframe, we will extract some more information from translation_df page column later. For now let's see what other information we have in page query strings.

In [13]:
tidy.check_query_keys(page_vals)

/weekly-brainpost/tag/sharp-wave+ripple
/weekly-brainpost/tag/sharp-wave+ripple
/weekly-brainpost/tag/precentral+gyru
/weekly-brainpost/tag/precentral+gyru
/weekly-brainpost/tag/NMDA+blocker
/weekly-brainpost/tag/NMDA+blocker
/weekly-brainpost/tag/dorsolateral+prefrontal+corte
/weekly-brainpost/tag/dorsolateral+prefrontal+corte
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options?fbclid
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options?fbclid
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-option
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-option
/weekly-brainpost/2020/9/8/neuropeptides-and-astrocytes-regulate-adult-neural-stem-cell-activit
/weekly-brainpost/2020/9/8/neuropeptides-and-astrocytes-regulate-adult-neural-stem-cell-activit
/weekly-brainpost/2020/9/29/rem-sleep-is-necessary-for-experience-dependent-plasticity?fbclid
/weekly-brainpost/2020/9/29/

/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/searc
/searc
/searc
/searc
/searc
/searc
/searc
/searc
/searc
/searc
/occasional-contributor
/occasional-contributor
/fa
/fa
/fa
/fa
/fa
/fa
/contac
/contac
/contac
/contac
/checkout/donate?donatePageId
/checkout/donate?donatePageId
/checkout/donate?donatePageId
/checkout/donate?donatePageId
/checkout/donate?donatePageId
/checkout/donate?donatePageId
/brainpost-life-hacks/2019/5/7/why-we-make-decisions-together-tt9e
/brainpost-life-hacks/2019/5/7/why-we-make-decisions-together-tt9e
/brainpost-life-hacks/2019/3/21/can-we-alter-the-progression-of-huntingtons-diseas
/brainpost-life-hacks/2019/3/21/can-we-alter-the-progression-of-huntingtons-diseas
/brainpost-life-hacks/2019/10/22/understanding-the-anxious-brai
/brainpost-life-hacks/2019/10/22/understanding-the-anxious-brai
/brainpost-life-hacks/2019/10/22/understanding-the-anxious-brai
/brainpost-life-hacks/2019/10/22/u

/weekly-brainpost/2020/9/29/the-occipital-cortex-and-visual-memory-development?fbclid
/weekly-brainpost/2020/9/29/the-occipital-cortex-and-visual-memory-development?fbclid
/weekly-brainpost/2020/9/29/the-occipital-cortex-and-visual-memory-development?fbclid
/weekly-brainpost/2020/9/29/the-occipital-cortex-and-visual-memory-development?fbclid
/weekly-brainpost/2020/9/29/the-occipital-cortex-and-visual-memory-development?fbclid
/weekly-brainpost/2020/9/29/the-occipital-cortex-and-visual-memory-development?fbclid
/weekly-brainpost/2020/9/29/the-occipital-cortex-and-visual-memory-development?fbclid
/weekly-brainpost/2020/9/29/the-occipital-cortex-and-visual-memory-development?fbclid
/weekly-brainpost/2020/9/29/the-occipital-cortex-and-visual-memory-development?fbclid
/weekly-brainpost/2020/9/29/the-occipital-cortex-and-visual-memory-development?fbclid
/weekly-brainpost/2020/9/29/the-occipital-cortex-and-visual-memory-development?fbclid
/weekly-brainpost/2020/9/29/the-occipital-cortex-and-v

/weekly-brainpost/2020/5/19/how-the-heartbeat-influences-conscious-perceptio
/weekly-brainpost/2020/5/12/a-new-method-for-assessing-stroke-recover
/weekly-brainpost/2020/5/12/a-new-method-for-assessing-stroke-recover
/weekly-brainpost/2020/5/12/a-new-method-for-assessing-stroke-recover
/weekly-brainpost/2020/5/12/a-new-method-for-assessing-stroke-recover
/weekly-brainpost/2020/4/7/social-framing-effects-in-the-brai
/weekly-brainpost/2020/4/7/social-framing-effects-in-the-brai
/weekly-brainpost/2020/4/7/intranasal-vegfd-treatment-reduces-brain-damage-following-strok
/weekly-brainpost/2020/4/7/intranasal-vegfd-treatment-reduces-brain-damage-following-strok
/weekly-brainpost/2020/4/28/post-ingestion-cues-reinforce-food-seeking-behaviour
/weekly-brainpost/2020/4/28/post-ingestion-cues-reinforce-food-seeking-behaviour
/weekly-brainpost/2020/4/28/characterizing-the-neural-signature-of-preference
/weekly-brainpost/2020/4/28/characterizing-the-neural-signature-of-preference
/weekly-brainpost/2

/weekly-brainpost/2018/8/28/oxytocin-affects-social-sharing-and-brain-activity-in-wome
/weekly-brainpost/2018/8/21/the-liquid-phase-of-a-protein-aids-synaptic-vesicle-releas
/weekly-brainpost/2018/8/21/the-liquid-phase-of-a-protein-aids-synaptic-vesicle-releas
/weekly-brainpost/2018/8/21/caudate-nucleus-stimulation-induces-negative-repetitive-decision-makin
/weekly-brainpost/2018/8/21/caudate-nucleus-stimulation-induces-negative-repetitive-decision-makin
/weekly-brainpost/2018/8/14/the-brains-functional-connectivity-profile-in-bipolar-disorde
/weekly-brainpost/2018/8/14/the-brains-functional-connectivity-profile-in-bipolar-disorde
/weekly-brainpost/2018/7/31/structure-and-function-of-presynaptic-inputs-varies-by-distance-from-the-postsynaptic-neuron-cell-bod
/weekly-brainpost/2018/7/31/structure-and-function-of-presynaptic-inputs-varies-by-distance-from-the-postsynaptic-neuron-cell-bod
/weekly-brainpost/2018/7/31/structure-and-function-of-presynaptic-inputs-varies-by-distance-from-the-

/weekly-brainpost/2020/5/19/how-the-heartbeat-influences-conscious-perceptio
/weekly-brainpost/2020/5/19/how-the-heartbeat-influences-conscious-perceptio
/weekly-brainpost/2020/5/12/brain-synchronization-during-inter-group-hostilit
/weekly-brainpost/2020/5/12/brain-synchronization-during-inter-group-hostilit
/weekly-brainpost/2020/4/7/the-effects-of-exercise-on-cognition?fbclid
/weekly-brainpost/2020/4/7/the-effects-of-exercise-on-cognition?fbclid
/weekly-brainpost/2020/4/7/intranasal-vegfd-treatment-reduces-brain-damage-following-stroke?rq
/weekly-brainpost/2020/4/7/intranasal-vegfd-treatment-reduces-brain-damage-following-stroke?rq
/weekly-brainpost/2020/4/28/post-ingestion-cues-reinforce-food-seeking-behaviour
/weekly-brainpost/2020/4/28/post-ingestion-cues-reinforce-food-seeking-behaviour
/weekly-brainpost/2020/4/28/post-ingestion-cues-reinforce-food-seeking-behaviour
/weekly-brainpost/2020/4/28/post-ingestion-cues-reinforce-food-seeking-behaviour
/weekly-brainpost/2020/4/28/a-nove

/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/weekly-brainpos
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/search?q
/searc
/searc
/searc
/searc
/searc
/searc
/occasional-contributor
/occasional-contributor
/occasional-contributor
/occasional-contributor
/occasional-contributor
/occasional-contributor
/induced-pluripotent-stem-cell-technolog
/induced-pluripotent-stem-cell-technolog
/faq
/faq
/fa
/fa
/fa
/fa
/fa
/fa
/contac
/contac
/contac
/contac
/checkout/donate?donatePageId
/checkout/donate?donatePageId
/checkout/donate?donatePageId
/checkout/donate?donat

Not sure what all of these query strings means but some of them look interesting and useful:

- fbclid: Means traffic from link sharing on facebook.
- rq : when the page is opened from search result
- month:
- offset: pagination is clicked
- q : search on the site
- depth:
- back: back link from google.
- s: 
- platform:
- ss_source:
- amp:
- sqsscreenshot:
- donatePageId:
- url:

From, all these query strings we can obtain important information about traffic.
Let's extend our dataframe with few extra columns, just in case they can be handy later..

In [14]:
content_data_df = tidy.extend_data_with_info_from_page(content_data_df)
content_data_df.head(2)

Unnamed: 0,page,pageviews,unique_pageviews,avg_time_on_page,entrances,bounce_rate,percent_exit,source,medium,start_date,end_date,path,from_facebook,google_keyword,from_google,search_keyword,sqsscreenshot,platform
0,/weekly-brainpost/tag/sharp-wave+ripples,1,1,0,1,1.0,1.0,google,organic,"Oct 25, 2020","Oct 31, 2020",/weekly-brainpost/tag/sharp-wave+ripples,,,,,,
1,/weekly-brainpost/tag/sharp-wave+ripples,0,0,0,0,0.0,0.0,google,organic,"Oct 18, 2020","Oct 24, 2020",/weekly-brainpost/tag/sharp-wave+ripples,,,,,,


Let's find out what other information we can have from ```page``` column by checking the structure of ```path```.

In [15]:
content_data_df['path'].unique()

array(['/weekly-brainpost/tag/sharp-wave+ripples',
       '/weekly-brainpost/tag/precentral+gyrus',
       '/weekly-brainpost/tag/NMDA+blockers',
       '/weekly-brainpost/tag/dorsolateral+prefrontal+cortex',
       '/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options',
       '/weekly-brainpost/2020/9/8/neuropeptides-and-astrocytes-regulate-adult-neural-stem-cell-activity',
       '/weekly-brainpost/2020/9/29/rem-sleep-is-necessary-for-experience-dependent-plasticity',
       '/weekly-brainpost/2020/9/29/prediction-of-the-future-can-block-memory-formation',
       '/weekly-brainpost/2020/9/22/the-role-of-group-identity-in-social-influence',
       '/weekly-brainpost/2020/9/22/a-brain-rhythm-underlying-dissociation',
       '/weekly-brainpost/2020/9/15/tackling-covid-19-the-behavioural-consequences-of-face-mask-policies',
       '/weekly-brainpost/2020/9/15/prediction-errors-bias-time-perception',
       '/weekly-brainpost/2020/9/15/discovery-of-a-new-

Interestingly, the ```path```s have information about the titles and publication dates of the  ```weekly_brainposts```. Let's get all weekly_brainposts with their information in separate dataframe.

#### Extracting  information from weekly_brainposts

We are interested in weekly_brainposts. Lets have a look at what the data in weekly_brain_posts looks like

In [16]:
for index, row in content_data_df.iterrows():
    if '/weekly-brainpost' in row['path']:
        print(row['path'])

/weekly-brainpost/tag/sharp-wave+ripples
/weekly-brainpost/tag/sharp-wave+ripples
/weekly-brainpost/tag/precentral+gyrus
/weekly-brainpost/tag/precentral+gyrus
/weekly-brainpost/tag/NMDA+blockers
/weekly-brainpost/tag/NMDA+blockers
/weekly-brainpost/tag/dorsolateral+prefrontal+cortex
/weekly-brainpost/tag/dorsolateral+prefrontal+cortex
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/neuropeptides-and-astrocytes-regulate-adult-neural-stem-cell-activity
/weekly-brainpost/2020/9/8/neuropeptides-and-astrocytes-regulate-adult-neural-stem-cell-activity
/weekly-brainpost/2020/9/29/rem-sleep-is-necessary-for-experience-dependent-plasticity
/weekly-brainpost/2020/9/29/rem-sleep

/weekly-brainpost/2019/12/3/exploring-the-neurobiology-of-consciousness-using-the-psychedelic-dmt
/weekly-brainpost/2019/12/3/exercise-improves-brain-tissue-oxygenation-in-a-mouse-model-of-alzheimers-disease
/weekly-brainpost/2019/12/3/exercise-improves-brain-tissue-oxygenation-in-a-mouse-model-of-alzheimers-disease
/weekly-brainpost/2019/12/24/sleep-disturbance-and-migraine-onset
/weekly-brainpost/2019/12/24/sleep-disturbance-and-migraine-onset
/weekly-brainpost/2019/12/17/cognitive-deficits-in-temporal-lobe-epilepsy-and-alzheimer-like-pathologies
/weekly-brainpost/2019/12/17/cognitive-deficits-in-temporal-lobe-epilepsy-and-alzheimer-like-pathologies
/weekly-brainpost/2019/12/17/cognitive-deficits-in-temporal-lobe-epilepsy-and-alzheimer-like-pathologies
/weekly-brainpost/2019/12/17/cognitive-deficits-in-temporal-lobe-epilepsy-and-alzheimer-like-pathologies
/weekly-brainpost/2019/12/10/the-role-of-alpha-synchrony-in-spatial-attention-during-neurofeedback-training
/weekly-brainpost/2019

/weekly-brainpost/tag/dorsolateral+prefrontal+cortex
/weekly-brainpost/tag/burst+firing
/weekly-brainpost/tag/burst+firing
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/neuropeptides-and-astrocytes-regulate-adult-neural-stem-cell-activity
/weekly-brainpost/2020/9/8/neuropeptides-and-astrocytes-regulate-adult-neural-stem-cell-activity
/weekly-brainpost/2020/9/8/neuropeptides-and-astrocytes-regulate-adult-neural-stem-cell-activity
/weekly-brainpost/2020/9/8/neuropeptides-and-astrocytes-regulate-adult-neural-stem-cell-activity
/weekly-brainpost/2020/9/8/neuropeptides-and-astrocytes-regulate-adult-neural-stem-cell-activity
/weekly-brainpost/2020/9/8/neuropeptides-and-astrocytes-regulate-adult-neural-stem-cell-activity
/weekly-brainpost/2020/9/8/neuropeptides-and-astrocytes-regulate-adult-neural-stem-cell-activity
/weekly-brainpost/2020/9/8

/weekly-brainpost/2019/12/17/distorting-mental-maps-using-virtual-reality
/weekly-brainpost/2019/12/17/distorting-mental-maps-using-virtual-reality
/weekly-brainpost/2019/12/17/cognitive-deficits-in-temporal-lobe-epilepsy-and-alzheimer-like-pathologies
/weekly-brainpost/2019/12/17/cognitive-deficits-in-temporal-lobe-epilepsy-and-alzheimer-like-pathologies
/weekly-brainpost/2019/12/10/the-role-of-alpha-synchrony-in-spatial-attention-during-neurofeedback-training
/weekly-brainpost/2019/12/10/the-role-of-alpha-synchrony-in-spatial-attention-during-neurofeedback-training
/weekly-brainpost/2019/12/10/overlapping-emotion-gradients-in-the-human-temporo-parietal-cortex
/weekly-brainpost/2019/12/10/overlapping-emotion-gradients-in-the-human-temporo-parietal-cortex
/weekly-brainpost/2019/12/10/brain-cell-type-specific-enhancer-promoter-interactions-and-disease-risk
/weekly-brainpost/2019/12/10/brain-cell-type-specific-enhancer-promoter-interactions-and-disease-risk
/weekly-brainpost/2019/11/5/st

/weekly-brainpost/2020/9/15/prediction-errors-bias-time-perception
/weekly-brainpost/2020/9/15/prediction-errors-bias-time-perception
/weekly-brainpost/2020/9/15/prediction-errors-bias-time-perception
/weekly-brainpost/2020/9/15/prediction-errors-bias-time-perception
/weekly-brainpost/2020/9/15/prediction-errors-bias-time-perception
/weekly-brainpost/2020/9/15/prediction-errors-bias-time-perception
/weekly-brainpost/2020/9/15/prediction-errors-bias-time-perception
/weekly-brainpost/2020/9/15/prediction-errors-bias-time-perception
/weekly-brainpost/2020/9/15/discovery-of-a-new-taste-receptor
/weekly-brainpost/2020/9/15/discovery-of-a-new-taste-receptor
/weekly-brainpost/2020/9/15/discovery-of-a-new-taste-receptor
/weekly-brainpost/2020/9/15/discovery-of-a-new-taste-receptor
/weekly-brainpost/2020/9/15/discovery-of-a-new-taste-receptor
/weekly-brainpost/2020/9/15/discovery-of-a-new-taste-receptor
/weekly-brainpost/2020/9/15/discovery-of-a-new-taste-receptor
/weekly-brainpost/2020/9/15/di

/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-prefrontal-in-exploring-new-options
/weekly-brainpost/2020/9/8/the-role-of-the-medial-pref

/weekly-brainpost/2018/7/17/the-effects-of-maternal-cortisol-on-the-amygdala-and-internalizing-behaviours
/weekly-brainpost/2018/7/17/the-effects-of-maternal-cortisol-on-the-amygdala-and-internalizing-behaviours
/weekly-brainpost/2018/7/17/the-effects-of-maternal-cortisol-on-the-amygdala-and-internalizing-behaviours
/weekly-brainpost/2018/7/10/deep-brain-stimulation-of-the-thalamus-as-a-treatment-for-epilepsy
/weekly-brainpost/2018/7/10/deep-brain-stimulation-of-the-thalamus-as-a-treatment-for-epilepsy
/weekly-brainpost/2018/6/5/one-salience-network-two-functions
/weekly-brainpost/2018/6/5/one-salience-network-two-functions
/weekly-brainpost/2018/6/26/dialectical-behaviour-therapy-is-effective-for-adolescents-at-risk-of-suicide
/weekly-brainpost/2018/6/26/dialectical-behaviour-therapy-is-effective-for-adolescents-at-risk-of-suicide
/weekly-brainpost/2018/6/26/a-new-cell-type-in-the-hippocampus-contributes-to-sharp-waves-involved-in-memory
/weekly-brainpost/2018/6/26/a-new-cell-type-in-

/weekly-brainpost/2020/4/28/post-ingestion-cues-reinforce-food-seeking-behaviours
/weekly-brainpost/2020/4/28/post-ingestion-cues-reinforce-food-seeking-behaviours
/weekly-brainpost/2020/4/28/post-ingestion-cues-reinforce-food-seeking-behaviours
/weekly-brainpost/2020/4/28/post-ingestion-cues-reinforce-food-seeking-behaviours
/weekly-brainpost/2020/4/28/characterizing-the-neural-signature-of-preferences
/weekly-brainpost/2020/4/28/characterizing-the-neural-signature-of-preferences
/weekly-brainpost/2020/4/28/a-novel-pathway-underlying-hippocampal-neocortical-interactions
/weekly-brainpost/2020/4/28/a-novel-pathway-underlying-hippocampal-neocortical-interactions
/weekly-brainpost/2020/4/21/stopping-tau-in-its-tracks-investigating-the-spread-of-the-tau-protein-in-the-brain
/weekly-brainpost/2020/4/21/stopping-tau-in-its-tracks-investigating-the-spread-of-the-tau-protein-in-the-brain
/weekly-brainpost/2020/4/21/memories-that-share-a-common-structure-are-linked-together
/weekly-brainpost/2

/weekly-brainpost/2020/4/28/a-novel-pathway-underlying-hippocampal-neocortical-interactions
/weekly-brainpost/2020/4/28/a-novel-pathway-underlying-hippocampal-neocortical-interactions
/weekly-brainpost/2020/4/28/a-novel-pathway-underlying-hippocampal-neocortical-interactions
/weekly-brainpost/2020/4/28/a-novel-pathway-underlying-hippocampal-neocortical-interactions
/weekly-brainpost/2020/4/28/a-novel-pathway-underlying-hippocampal-neocortical-interactions
/weekly-brainpost/2020/4/28/a-novel-pathway-underlying-hippocampal-neocortical-interactions
/weekly-brainpost/2020/4/21/stopping-tau-in-its-tracks-investigating-the-spread-of-the-tau-protein-in-the-brain
/weekly-brainpost/2020/4/21/stopping-tau-in-its-tracks-investigating-the-spread-of-the-tau-protein-in-the-brain
/weekly-brainpost/2020/4/21/stopping-tau-in-its-tracks-investigating-the-spread-of-the-tau-protein-in-the-brain
/weekly-brainpost/2020/4/21/stopping-tau-in-its-tracks-investigating-the-spread-of-the-tau-protein-in-the-brain


/weekly-brainpost/2020/7/20/using-psychophysics-and-signal-detection-theory-to-improve-eyewitness-testimony
/weekly-brainpost/2020/7/20/microglia-ingest-myelin-sheaths-during-development
/weekly-brainpost/2020/7/20/microglia-ingest-myelin-sheaths-during-development
/weekly-brainpost/2020/7/20/microglia-ingest-myelin-sheaths-during-development
/weekly-brainpost/2020/7/20/microglia-ingest-myelin-sheaths-during-development
/weekly-brainpost/2020/7/20/microglia-ingest-myelin-sheaths-during-development
/weekly-brainpost/2020/7/20/microglia-ingest-myelin-sheaths-during-development
/weekly-brainpost/2020/7/20/microglia-ingest-myelin-sheaths-during-development
/weekly-brainpost/2020/7/20/microglia-ingest-myelin-sheaths-during-development
/weekly-brainpost/2020/7/20/microglia-ingest-myelin-sheaths-during-development
/weekly-brainpost/2020/7/20/microglia-ingest-myelin-sheaths-during-development
/weekly-brainpost/2020/7/20/microglia-ingest-myelin-sheaths-during-development
/weekly-brainpost/2020/

/weekly-brainpost/tag/ketamine
/weekly-brainpost/tag/immune+training
/weekly-brainpost/tag/immune+training
/weekly-brainpost/tag/dorsomedial+prefrontal+cortex
/weekly-brainpost/tag/dorsomedial+prefrontal+cortex
/weekly-brainpost/tag/brain+activation
/weekly-brainpost/tag/brain+activation
/weekly-brainpost/2020/8/4/unraveling-the-role-of-rem-and-non-rem-sleep-in-visual-learning-and-brain-plasticity
/weekly-brainpost/2020/8/4/unraveling-the-role-of-rem-and-non-rem-sleep-in-visual-learning-and-brain-plasticity
/weekly-brainpost/2020/8/4/unraveling-the-role-of-rem-and-non-rem-sleep-in-visual-learning-and-brain-plasticity
/weekly-brainpost/2020/8/4/unraveling-the-role-of-rem-and-non-rem-sleep-in-visual-learning-and-brain-plasticity
/weekly-brainpost/2020/8/4/unraveling-the-role-of-rem-and-non-rem-sleep-in-visual-learning-and-brain-plasticity
/weekly-brainpost/2020/8/4/unraveling-the-role-of-rem-and-non-rem-sleep-in-visual-learning-and-brain-plasticity
/weekly-brainpost/2020/8/4/unraveling-t

/weekly-brainpost/2018/3/13/an-unconscious-intervention-for-fear
/weekly-brainpost/2018/3/13/an-unconscious-intervention-for-fear
/weekly-brainpost/2018/2/5/the-role-of-white-matter-connections-in-adolescent-mental-health-and-cognition
/weekly-brainpost/2018/2/5/the-role-of-white-matter-connections-in-adolescent-mental-health-and-cognition
/weekly-brainpost/2018/2/27/7vdnew0lwomgpkhshp412xl5vm2z7f
/weekly-brainpost/2018/2/27/7vdnew0lwomgpkhshp412xl5vm2z7f
/weekly-brainpost/2018/2/26/0q9qv2xnfvchnl47qjdy2un02l0h98
/weekly-brainpost/2018/2/26/0q9qv2xnfvchnl47qjdy2un02l0h98
/weekly-brainpost/2018/2/26/0q9qv2xnfvchnl47qjdy2un02l0h98
/weekly-brainpost/2018/2/26/0q9qv2xnfvchnl47qjdy2un02l0h98
/weekly-brainpost/2018/2/19/resting-brain-activity-predicts-who-responds-to-cognitive-behavioral-therapy-for-ocd
/weekly-brainpost/2018/2/19/resting-brain-activity-predicts-who-responds-to-cognitive-behavioral-therapy-for-ocd
/weekly-brainpost/2018/2/19/ketamine-blocks-burst-firing-to-provide-depression

Looks like most of the columns are from weekly_brainpost

We have some important information in ```path```.
##### Using the information in ```path``` column, let's tidy up the information for weekly_brainposts with all information we need
1.  Check if there are urls in weekly_brainpost with no further path. For example http://brainpost.co/weekly_brainpost/
2.  Check how the paths is formed. For example, tags, dates etc..
3. Create new columns tag, page_title, pub_date, section=weekly_brainpost

In [17]:
content_data_df = tidy.tidy_weekly_brainpost(content_data_df)

In [18]:
content_data_df.head(2)

Unnamed: 0,page,pageviews,unique_pageviews,avg_time_on_page,entrances,bounce_rate,percent_exit,source,medium,start_date,...,from_facebook,google_keyword,from_google,search_keyword,sqsscreenshot,platform,section,tag,pub_date,page_title
0,/weekly-brainpost/tag/sharp-wave+ripples,1,1,0,1,1.0,1.0,google,organic,"Oct 25, 2020",...,,,,,,,weekly-brainpost,sharp-wave ripples,,
1,/weekly-brainpost/tag/sharp-wave+ripples,0,0,0,0,0.0,0.0,google,organic,"Oct 18, 2020",...,,,,,,,weekly-brainpost,sharp-wave ripples,,


Apart from weekly brainposts, let's see what else we have in 'path'

The other big section is "brainpost-life-hacks" and 'blog' let's extract info out of them in the same way.

In [19]:
content_data_df = tidy.tidy_brainpost_life_hacks(content_data_df)
content_data_df = tidy.tidy_blog(content_data_df)

In [20]:
#check remaining path for other section names

In [21]:
paths = []
for index, row in content_data_df.iterrows():
    if '/weekly-brainpost' not in row['path'] and '/brainpost-life-hacks' not in row['path'] :
        paths.append(row['path'])
print(list(set(paths)))

['/induced-pluripotent-stem-cell-technology', '/occasional-contributors', '/cache.aspx', '/home', '/brainpost-workflow-infographic', '/checkout/donate', '/blog/2019/1/28/scientists-have-got-the-recipe-for-growing-miniature-human-brains-just-right', '/about-brainpost/', '/faq/', '/faq', '/kayla-simanek', '/apis/site/proxy', '/contact', '/blog', '/archives', '/archives/', '/search', '/about-brainpost']


Now, Let's fill section column for remaining sections with no additional information in path column.

In [22]:
remaining_section_names = ['home','contact','archives', 
                           'faq', 'occasional-contributors', 
                           'about-brainpost', 
                           'brainpost-workflow-infographic', 'faq', 'search', 'archives', 'checkout'
                          ]
content_data_df = tidy.set_section_names(content_data_df, remaining_section_names)

Now all pages should have sections. Let's check,

In [31]:
nosection = content_data_df['section'].isnull() 
content_data_df[nosection][['page','section']]

Unnamed: 0,page,section
2708,/kayla-simanek,
2709,/kayla-simanek,
2726,/cache.aspx?q=yanmei+zhou+nature+communication...,
2727,/cache.aspx?q=yanmei+zhou+nature+communication...,
3854,/apis/site/proxy?url=https://www.brainpost.co/...,
3855,/apis/site/proxy?url=https://www.brainpost.co/...,
5068,/kayla-simanek,
5069,/kayla-simanek,
6120,/kayla-simanek,
6121,/kayla-simanek,


There are only a few such pages which do not belong to any sections, let's call them ```others```.

In [35]:
content_data_df['section'] = content_data_df['section'].replace([np.NaN],'other')

In [37]:
#check if section has null
nosection = content_data_df['section'].isnull() 
content_data_df[nosection][['page','section']]

Unnamed: 0,page,section


Let's tidy up translation df and save it to the file.

In [24]:
translation_df = tidy.extend_data_with_info_from_page(translation_df)
translation_df = tidy.tidy_weekly_brainpost(translation_df)
translation_df = tidy.tidy_brainpost_life_hacks(translation_df)
translation_df = tidy.tidy_blog(translation_df)
translation_df = tidy.set_section_names(translation_df, remaining_section_names)
print(translation_df.head())
translation_df.to_csv(f'{DATA_DIR}/translation_df_for_analysis.csv', index=False)

  language                                               page  \
0       tr  /weekly-brainpost/2020/10/6/neuronal-computati...   
1       tr  /weekly-brainpost/2020/10/6/neuronal-computati...   
2       fa  /weekly-brainpost/2018/6/19/stress-hormones-se...   
3       fa  /weekly-brainpost/2018/6/19/stress-hormones-se...   
4       fr  /brainpost-life-hacks/2019/1/2/new-year-new-me...   

                                                path from_facebook  \
0  /weekly-brainpost/2020/10/6/neuronal-computati...          True   
1  /weekly-brainpost/2020/10/6/neuronal-computati...          True   
2  /weekly-brainpost/2018/6/19/stress-hormones-se...           NaN   
3  /weekly-brainpost/2018/6/19/stress-hormones-se...           NaN   
4  /brainpost-life-hacks/2019/1/2/new-year-new-me...           NaN   

                section   pub_date  \
0      weekly-brainpost  2020/10/6   
1      weekly-brainpost  2020/10/6   
2      weekly-brainpost  2018/6/19   
3      weekly-brainpost  2018/6/19  

For future analysis, let's save the extended dataframe in new csv file.

In [38]:
content_data_df.to_csv(f'{DATA_DIR}/ext_df_for_analysis.csv', index=False)

Now, we have cleaned up data, We can do analysis in next Notebook. - [Data Analysis](data_analysis.ipynb)