## Analyzing Website Data using Pandas

As we understood how to get the data from website using BeautifulSoup, let us go ahead and perform few scenarios to validate data using Pandas.
* Create data frame using `url_and_content_list`.
* Get length of content for each url.
* Get word count for each url.
* Get the list of pages with content less than 30 words.
* Get unique word count for each url.
* Get number of times each word repeated for each url.

In [2]:
%run 09_extracting_content_from_website.ipynb

CPU times: total: 23.3 s
Wall time: 1min 33s
https://python.itversity.com/01_overview_of_windows_os/01_overview_of_windows_os.html : 233
https://python.itversity.com/01_overview_of_windows_os/02_getting_system_details.html : 463
https://python.itversity.com/01_overview_of_windows_os/03_managing_windows_system.html : 475
https://python.itversity.com/01_overview_of_windows_os/04_overview_of_microsoft_office.html : 573
https://python.itversity.com/01_overview_of_windows_os/05_overview_of_editors_and_ides.html : 39
https://python.itversity.com/01_overview_of_windows_os/06_power_shell_and_command_prompt.html : 41
https://python.itversity.com/01_overview_of_windows_os/07_connecting_to_linux_servers.html : 38
https://python.itversity.com/01_overview_of_windows_os/08_folders_and_files.html : 28
https://python.itversity.com/02_setup_ubuntu_vm_on_gcp/01_setup_ubuntu_vm_on_gcp.html : 323
https://python.itversity.com/02_setup_ubuntu_vm_on_gcp/02_signing_up_for_gcp.html : 465


* Create data frame using `url_and_content_list`

In [3]:
import pandas as pd

url_and_content_df = pd.DataFrame(url_and_content_list, columns=['url', 'content'])

In [4]:
url_and_content_df.head()

Unnamed: 0,url,content
0,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Windows Operating System¶\...
1,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nGetting System Details¶\nLet us unders...
2,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nManaging Windows System¶\nLet us under...
3,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Microsoft Office¶\nAs IT P...
4,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Editors and IDEs¶\n\n\n\n\n\n


* Get length of content for each url

In [5]:
url_and_content_df['content_length'] = url_and_content_df['content'].str.len()

In [6]:
url_and_content_df

Unnamed: 0,url,content,content_length
0,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Windows Operating System¶\...,233
1,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nGetting System Details¶\nLet us unders...,463
2,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nManaging Windows System¶\nLet us under...,475
3,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Microsoft Office¶\nAs IT P...,573
4,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Editors and IDEs¶\n\n\n\n\n\n,39
...,...,...,...
182,https://python.itversity.com/18_database_progr...,\n\n\n\nRecap of Insert¶\nLet us recap about I...,3092
183,https://python.itversity.com/18_database_progr...,\n\n\n\nPreparing Database¶\nLet us setup the ...,1614
184,https://python.itversity.com/18_database_progr...,\n\n\n\nReading Data from File¶\nLet us read b...,2544
185,https://python.itversity.com/18_database_progr...,\n\n\n\nBatch Loading of Data¶\nLet us underst...,4431


* Get word count for each url

In [7]:
url_and_content_df['word_count'] = url_and_content_df['content'].str.split(' ').str.len()

In [8]:
url_and_content_df.sort_values('word_count')

Unnamed: 0,url,content,content_length,word_count
141,https://python.itversity.com/14_overview_of_ob...,\n\n\n\nInheritance¶\n\n\n\n\n\n,22,1
139,https://python.itversity.com/14_overview_of_ob...,\n\n\n\nConstructors¶\n\n\n\n\n\n,23,1
140,https://python.itversity.com/14_overview_of_ob...,\n\n\n\nMethods¶\n\n\n\n\n\n,18,1
142,https://python.itversity.com/14_overview_of_ob...,\n\n\n\nEncapsulation¶\n\n\n\n\n\n,24,1
143,https://python.itversity.com/14_overview_of_ob...,\n\n\n\nPolymorphism¶\n\n\n\n\n\n,23,1
...,...,...,...,...
132,https://python.itversity.com/13_understanding_...,\n\n\n\nRow level transformations using map¶\n...,23527,2105
60,https://python.itversity.com/07_pre_defined_fu...,\n\n\n\nString Manipulation Functions¶\nLet us...,15587,3724
149,https://python.itversity.com/15_overview_of_pa...,\n\n\n\nData Frames - Basic Operations¶\nHere ...,18321,3752
68,https://python.itversity.com/08_user_defined_f...,\n\n\n\nDoc Strings¶\nDocumentation is one of ...,16342,3936


* Get the list of pages with content less than 30 words

In [9]:
for url in url_and_content_df.query('word_count <= 30')['url']:
    print(url)

https://python.itversity.com/01_overview_of_windows_os/01_overview_of_windows_os.html
https://python.itversity.com/01_overview_of_windows_os/05_overview_of_editors_and_ides.html
https://python.itversity.com/01_overview_of_windows_os/06_power_shell_and_command_prompt.html
https://python.itversity.com/01_overview_of_windows_os/07_connecting_to_linux_servers.html
https://python.itversity.com/01_overview_of_windows_os/08_folders_and_files.html
https://python.itversity.com/07_pre_defined_functions/01_pre_defined_functions.html
https://python.itversity.com/12_development_of_map_reduce_apis/01_development_of_map_reduce_apis.html
https://python.itversity.com/14_overview_of_object_oriented_programming/01_overview_of_object_oriented_programming.html
https://python.itversity.com/14_overview_of_object_oriented_programming/02_classes_and_objects.html
https://python.itversity.com/14_overview_of_object_oriented_programming/03_constructors.html
https://python.itversity.com/14_overview_of_object_orient

* Get unique word count for each url

In [10]:
def get_unique_count(words):
    return len(set(words))

In [11]:
url_and_content_df['unique_word_count'] = url_and_content_df.apply(func=lambda cols: get_unique_count(cols['content'].split(' ')), axis=1)

In [12]:
url_and_content_df

Unnamed: 0,url,content,content_length,word_count,unique_word_count
0,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Windows Operating System¶\...,233,25,20
1,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nGetting System Details¶\nLet us unders...,463,73,56
2,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nManaging Windows System¶\nLet us under...,475,65,56
3,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Microsoft Office¶\nAs IT P...,573,79,53
4,https://python.itversity.com/01_overview_of_wi...,\n\n\n\nOverview of Editors and IDEs¶\n\n\n\n\n\n,39,5,5
...,...,...,...,...,...
182,https://python.itversity.com/18_database_progr...,\n\n\n\nRecap of Insert¶\nLet us recap about I...,3092,261,130
183,https://python.itversity.com/18_database_progr...,\n\n\n\nPreparing Database¶\nLet us setup the ...,1614,144,99
184,https://python.itversity.com/18_database_progr...,\n\n\n\nReading Data from File¶\nLet us read b...,2544,223,126
185,https://python.itversity.com/18_database_progr...,\n\n\n\nBatch Loading of Data¶\nLet us underst...,4431,502,242


* Get number of times each word repeated for each url

In [13]:
words = url_and_content_df['content'].str.split(' ', expand=True)
words

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12467,12468,12469,12470,12471,12472,12473,12474,12475,12476
0,\n\n\n\nOverview,of,Windows,Operating,System¶\n\nGetting,System,Details\nManaging,Windows,System\nOverview,of,...,,,,,,,,,,
1,\n\n\n\nGetting,System,Details¶\nLet,us,understand,how,to,get,System,Details,...,,,,,,,,,,
2,\n\n\n\nManaging,Windows,System¶\nLet,us,understand,how,to,manage,system,effectively.,...,,,,,,,,,,
3,\n\n\n\nOverview,of,Microsoft,Office¶\nAs,IT,"Professionals,",it,is,important,to,...,,,,,,,,,,
4,\n\n\n\nOverview,of,Editors,and,IDEs¶\n\n\n\n\n\n,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
182,\n\n\n\nRecap,of,Insert¶\nLet,us,recap,about,INSERT,statement,as,we,...,,,,,,,,,,
183,\n\n\n\nPreparing,Database¶\nLet,us,setup,the,database,along,with,tables,to,...,,,,,,,,,,
184,\n\n\n\nReading,Data,from,File¶\nLet,us,read,both,orders,as,well,...,,,,,,,,,,
185,\n\n\n\nBatch,Loading,of,Data¶\nLet,us,understand,how,we,should,take,...,,,,,,,,,,


In [14]:
s = words.stack().reset_index(drop=True)

In [17]:
s

0               \n\n\n\nOverview
1                             of
2                        Windows
3                      Operating
4             System¶\n\nGetting
                  ...           
82174                         we
82175                        can
82176                        use
82177                      those
82178    features.\n\n\n\n\n\n\n
Length: 82179, dtype: object

In [20]:
words.unstack()

0      0      \n\n\n\nOverview
       1       \n\n\n\nGetting
       2      \n\n\n\nManaging
       3      \n\n\n\nOverview
       4      \n\n\n\nOverview
                    ...       
12476  182                None
       183                None
       184                None
       185                None
       186                None
Length: 2333199, dtype: object

In [18]:
type(s)

pandas.core.series.Series

In [15]:
word_count = s.groupby(s).agg(['count'])

In [16]:
word_count.query('count >= 50 and count <= 100')

Unnamed: 0,count
"'Friday',",55
"'Monday',",52
"'Saturday',",55
"'Sunday',",51
"'Thursday',",66
...,...
typically,54
understand,76
used,62
value,79


In [22]:
word_count

Unnamed: 0,count
,24417
\n,344
\n\n,1
\n\n\n\n\n\n\n#,3
\n\n\n\n\n\n\n%sql,1
...,...
“state”:,1
“uniq_id”:,1
“zip”:,1
”,1
