# Data Analysis

In this notebook, we will analyse the page content data.
Let's start by loading the data.

In [1]:
import pandas as pd

In [2]:
DATA_DIR = 'data'
content_data = 'content_data.csv'
page_view_data = 'page_view_data.csv' 

In [3]:
content_data_df = pd.read_csv(f'{DATA_DIR}/{content_data}')

In [4]:
content_data_df.head()

Unnamed: 0,Page,Source / Medium,Date Range,Pageviews,Unique Pageviews,Avg. Time on Page,Entrances,Bounce Rate,% Exit,Page Value
0,/weekly-brainpost/tag/sharp-wave+ripples,google / organic,"Oct 25, 2020 - Oct 31, 2020",1,1,00:00:00,1,1.0,1.0,0
1,/weekly-brainpost/tag/sharp-wave+ripples,google / organic,"Oct 18, 2020 - Oct 24, 2020",0,0,00:00:00,0,0.0,0.0,0
2,/weekly-brainpost/tag/precentral+gyrus,google / organic,"Oct 25, 2020 - Oct 31, 2020",0,0,00:00:00,0,0.0,0.0,0
3,/weekly-brainpost/tag/precentral+gyrus,google / organic,"Oct 18, 2020 - Oct 24, 2020",1,1,00:00:00,1,1.0,1.0,0
4,/weekly-brainpost/tag/NMDA+blockers,google / organic,"Oct 25, 2020 - Oct 31, 2020",2,2,00:06:54,2,0.5,0.5,0


#### Understanding the data

- Page
    - Page column contains the page information from the URL. The data here has plenty of information related to the page. We just need to organize it.
- Source/Medium
    - This is the source of the traffic. whether the traffic is organic or from Google or both.
- Date range
    - This is the date range for the information has start and end date separated by a hyphen.
- Pageviews
    - Count of page views
- Unique page Views
    - page views from unique user
- Average time spent on page
    - This is also self explanatory. The average time is given in the format HH:mm:ss
- Entrances
    - Number of entrances to the page
- Bounce Rate
    - Bounce rate of the page
- % Exit
    - Percentage of exit 
- Page Value
    - Page value
    
    

There is so much information, but we need to tidy up the data before we can see some useful insights.

### Data Cleanup
Let's start cleaning up. Beginning from the left:

1. Page
    - As the page column gives us information about the page url, as the url is well optimized, it gives clear information about which category the page belongs. We can see some tag names too. So let's have a look at the structure.

In [5]:
page_vals = content_data_df['Page'].values

In [6]:
page_vals_separated = [page.replace('+','_').split('/') for page in page_vals]
section_counts = [len(page_info) for page_info in page_vals_separated]
unique_lengths = list(set(section_counts))
print(f'Unique lengths available: {unique_lengths}')

Unique lengths available: [2, 3, 4, 6, 7, 9, 13]


Hmm, So we have url strings upto 13 segments. Let's see what we have in strings with different lenghts

In [7]:
for length in unique_lengths:
    print(f'- {[page_info for page_info in page_vals_separated if len(page_info) == length][0]}')


- ['', 'weekly-brainpost?offset=1603822820040']
- ['', 'weekly-brainpost', ')']
- ['', 'weekly-brainpost', 'tag', 'sharp-wave_ripples']
- ['', 'weekly-brainpost', '2020', '9', '8', 'the-role-of-the-medial-prefrontal-in-exploring-new-options?fbclid=IwAR1RxVmb6I5u9dTobrQUPZopN59P2DKA7vWMDIfQOT7crURaHVFjZPBmtyo']
- ['', 'weekly-brainpost', '2020', '4', '21', 'stopping-tau-in-its-tracks-investigating-the-spread-of-the-tau-protein-in-the-brain?rq=CRISPR', 'Cas9']
- ['', 'weekly-brainpost', '2020', '7', '14', 'exposure-to-peers-pro-diversity-attitude-increases-inclusion-and-reduces-the-achievement-gap?back=https:', '', 'www.google.com', 'search?client=safari&as_qdr=all&as_occt=any&safe=active&as_q=How_do_gropes_Norm_promo_inclusion&channel=aplab&source=a-app1&hl=en']
- ['', 'weekly-brainpost', '2020', '6', '30', 'converting-astrocytes-into-neurons-reverses-motor-deficits-in-a-model-of-parkinsons-diseasehttps:', '', 'www.brainpost.co', 'weekly-brainpost', '2020', '6', '30', 'converting-astroc

We see different types of urls here. The longest one seems to have repeated url. And, of course the date has '/'. Let's confirm all.

In [8]:
[page_info for page_info in page_vals_separated if len(page_info) == 9]

[['',
  'weekly-brainpost',
  '2020',
  '7',
  '14',
  'exposure-to-peers-pro-diversity-attitude-increases-inclusion-and-reduces-the-achievement-gap?back=https:',
  '',
  'www.google.com',
  'search?client=safari&as_qdr=all&as_occt=any&safe=active&as_q=How_do_gropes_Norm_promo_inclusion&channel=aplab&source=a-app1&hl=en'],
 ['',
  'weekly-brainpost',
  '2020',
  '7',
  '14',
  'exposure-to-peers-pro-diversity-attitude-increases-inclusion-and-reduces-the-achievement-gap?back=https:',
  '',
  'www.google.com',
  'search?client=safari&as_qdr=all&as_occt=any&safe=active&as_q=How_do_gropes_Norm_promo_inclusion&channel=aplab&source=a-app1&hl=en'],
 ['',
  'weekly-brainpost',
  '2018',
  '9',
  '4',
  'mapping-subjective-feelings-ez8bf?back=https:',
  '',
  'www.google.com',
  'search?client=safari&as_qdr=all&as_occt=any&safe=active&as_q=what_is_a_subjective_feeling&channel=aplab&source=a-app1&hl=en'],
 ['',
  'weekly-brainpost',
  '2018',
  '9',
  '4',
  'mapping-subjective-feelings-ez8bf?ba

Some appears to be long urls for having back-link url attached. Some coming from Google Translate. (This could be an interesting information)

In [9]:
[page_info for page_info in page_vals_separated if len(page_info) == 6]

[['',
  'weekly-brainpost',
  '2020',
  '9',
  '8',
  'the-role-of-the-medial-prefrontal-in-exploring-new-options?fbclid=IwAR1RxVmb6I5u9dTobrQUPZopN59P2DKA7vWMDIfQOT7crURaHVFjZPBmtyo'],
 ['',
  'weekly-brainpost',
  '2020',
  '9',
  '8',
  'the-role-of-the-medial-prefrontal-in-exploring-new-options?fbclid=IwAR1RxVmb6I5u9dTobrQUPZopN59P2DKA7vWMDIfQOT7crURaHVFjZPBmtyo'],
 ['',
  'weekly-brainpost',
  '2020',
  '9',
  '8',
  'the-role-of-the-medial-prefrontal-in-exploring-new-options'],
 ['',
  'weekly-brainpost',
  '2020',
  '9',
  '8',
  'the-role-of-the-medial-prefrontal-in-exploring-new-options'],
 ['',
  'weekly-brainpost',
  '2020',
  '9',
  '8',
  'neuropeptides-and-astrocytes-regulate-adult-neural-stem-cell-activity'],
 ['',
  'weekly-brainpost',
  '2020',
  '9',
  '8',
  'neuropeptides-and-astrocytes-regulate-adult-neural-stem-cell-activity'],
 ['',
  'weekly-brainpost',
  '2020',
  '9',
  '29',
  'rem-sleep-is-necessary-for-experience-dependent-plasticity?fbclid=IwAR3Cye9Tz5Q4Xh

In [10]:
[page_info for page_info in page_vals_separated if len(page_info) == 4]

[['', 'weekly-brainpost', 'tag', 'sharp-wave_ripples'],
 ['', 'weekly-brainpost', 'tag', 'sharp-wave_ripples'],
 ['', 'weekly-brainpost', 'tag', 'precentral_gyrus'],
 ['', 'weekly-brainpost', 'tag', 'precentral_gyrus'],
 ['', 'weekly-brainpost', 'tag', 'NMDA_blockers'],
 ['', 'weekly-brainpost', 'tag', 'NMDA_blockers'],
 ['', 'weekly-brainpost', 'tag', 'dorsolateral_prefrontal_cortex'],
 ['', 'weekly-brainpost', 'tag', 'dorsolateral_prefrontal_cortex'],
 ['', 'weekly-brainpost', 'tag', 'songbirds'],
 ['', 'weekly-brainpost', 'tag', 'songbirds'],
 ['', 'weekly-brainpost', 'tag', 'orbitofrontal_cortex'],
 ['', 'weekly-brainpost', 'tag', 'orbitofrontal_cortex'],
 ['', 'weekly-brainpost', 'tag', 'lateral_habenula'],
 ['', 'weekly-brainpost', 'tag', 'lateral_habenula'],
 ['', 'weekly-brainpost', 'tag', 'lateral_habenula'],
 ['', 'weekly-brainpost', 'tag', 'lateral_habenula'],
 ['', 'weekly-brainpost', 'tag', 'dorsolateral_prefrontal_cortex'],
 ['', 'weekly-brainpost', 'tag', 'dorsolateral_p

In [11]:
[page_info for page_info in page_vals_separated if len(page_info) == 3]

[['', 'weekly-brainpost', ')'],
 ['', 'weekly-brainpost', ')'],
 ['', 'weekly-brainpost', ''],
 ['', 'weekly-brainpost', ''],
 ['',
  'checkout',
  'donate?donatePageId=5af34d5e2b6a28770569e501&ss_cid=c9bdfba3-7b16-4f53-b153-f51f46563ce6&ss_cvisit=1603211959498&ss_cvr=796dd6a6-e508-4fa3-bfcf-2af576396129|1603211959317|1603211959317|1603211959317|1'],
 ['',
  'checkout',
  'donate?donatePageId=5af34d5e2b6a28770569e501&ss_cid=c9bdfba3-7b16-4f53-b153-f51f46563ce6&ss_cvisit=1603211959498&ss_cvr=796dd6a6-e508-4fa3-bfcf-2af576396129|1603211959317|1603211959317|1603211959317|1'],
 ['',
  'checkout',
  'donate?donatePageId=5af34d5e2b6a28770569e501&ss_cid=5efd2907-3042-466c-b715-717b925ef11f&ss_cvisit=1603749137823&ss_cvr=b8097e66-931a-4a00-8609-14c9739be6fc|1603749137740|1603749137740|1603749137740|1'],
 ['',
  'checkout',
  'donate?donatePageId=5af34d5e2b6a28770569e501&ss_cid=5efd2907-3042-466c-b715-717b925ef11f&ss_cvisit=1603749137823&ss_cvr=b8097e66-931a-4a00-8609-14c9739be6fc|1603749137740

In [12]:
[page_info for page_info in page_vals_separated if len(page_info) == 2]

[['', 'weekly-brainpost?offset=1603822820040'],
 ['', 'weekly-brainpost?offset=1603822820040'],
 ['', 'weekly-brainpost?offset=1603822820040'],
 ['', 'weekly-brainpost?offset=1603822820040'],
 ['', 'weekly-brainpost?offset=1603208522960'],
 ['', 'weekly-brainpost?offset=1603208522960'],
 ['', 'weekly-brainpost?offset=1603208522960'],
 ['', 'weekly-brainpost?offset=1603208522960'],
 ['', 'weekly-brainpost?offset=1602617645569'],
 ['', 'weekly-brainpost?offset=1602617645569'],
 ['', 'weekly-brainpost?offset=1602617645569'],
 ['', 'weekly-brainpost?offset=1602617645569'],
 ['', 'weekly-brainpost?offset=1602014487556'],
 ['', 'weekly-brainpost?offset=1602014487556'],
 ['', 'weekly-brainpost?offset=1602014487556'],
 ['', 'weekly-brainpost?offset=1602014487556'],
 ['', 'weekly-brainpost?offset=1601397513565'],
 ['', 'weekly-brainpost?offset=1601397513565'],
 ['', 'weekly-brainpost?offset=1600803957122'],
 ['', 'weekly-brainpost?offset=1600803957122'],
 ['', 'weekly-brainpost?offset=160019937

### Overall observation on pages. From the observation. Here are some rules we can apply
- First of all, let's have a links with home pages and name the page_string as 'home'. There are just '/' only and '/home'

In [13]:
def generate_page_name(page_original):
    page_name = 'NONE'
    page_from = 'NONE'
    if page_original in ['/', '/home']:
        page_name = 'home'
    elif '/?' in page_original:
        page_name = 'home'
        page_from = page_original[page_original.find('?')+1:page_original.find('=')]
    else:
        pass
    return page_name, page_from

In [14]:
for page_val in page_vals:
    pname, pfrom = generate_page_name(page_val)
    if pfrom not in ["NONE"]:
        print(pfrom)

sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
fbclid
fbclid
fbclid
fbclid
fbclid
fbclid
offset
offset
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
fbclid
fbclid
fbclid
fbclid
fbclid
fbclid
ss_source
ss_source
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
fbclid
fbclid
fbclid
fbclid
fbclid
fbclid
ss_source
ss_source
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
fbclid
fbclid
fbclid
fbclid
fbclid
fbclid
offset
offset
ss_source
ss_source
ss_source
ss_source
ss_source
ss_source
ss_source
ss_source
sqsscreenshot
sqsscreenshot
sqsscreenshot
sqsscreenshot
fbclid
fbclid
fbclid
fbclid
fbclid
fbclid
fbclid
fbclid
fbclid
fbclid
fbclid
fbclid
amp
amp
offset
offset
sqsscreenshot
sqss