# Explore Newsgroups with Regexes
The machine learning library Scikit-Learn has several thousand posts from [an old internet newsgroup][1]. 400 of these posts are in the stored in the `newsgroups.csv` file in the data directory. This is a great dataset to practice your regular expressions.

## Read in data
There are just two columns, one for the category and the other for the text of the post.

[1]: http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups

In [1]:
import pandas as pd
news = pd.read_csv('../data/newsgroups.csv')
news.head()

Unnamed: 0,category,text
0,sci.med,From: nyeda@cnsvax.uwec.edu (David Nye)\nSubje...
1,talk.politics.guns,From: ndallen@r-node.hub.org (Nigel Allen)\nSu...
2,misc.forsale,From: mark@ardsley.business.uwo.ca (Mark Bramw...
3,misc.forsale,From: zmed16@trc.amoco.com (Michael)\nSubject:...
4,talk.politics.guns,From: fcrary@ucsu.Colorado.EDU (Frank Crary)\n...


The original dataset has 20 newsgroups, labeled categories here.

In [2]:
news['category'].value_counts()

misc.forsale          73
rec.autos             72
sci.space             65
talk.politics.guns    64
sci.med               63
rec.sport.baseball    63
Name: category, dtype: int64

In [3]:
news['text'].head()

0    From: nyeda@cnsvax.uwec.edu (David Nye)\nSubje...
1    From: ndallen@r-node.hub.org (Nigel Allen)\nSu...
2    From: mark@ardsley.business.uwo.ca (Mark Bramw...
3    From: zmed16@trc.amoco.com (Michael)\nSubject:...
4    From: fcrary@ucsu.Colorado.EDU (Frank Crary)\n...
Name: text, dtype: object

Output an entire post with the `print` function.

In [4]:
print(news['text'].values[0])

From: nyeda@cnsvax.uwec.edu (David Nye)
Subject: Re: Post Polio Syndrome Information Needed Please !!!
Organization: University of Wisconsin Eau Claire
Lines: 21

[reply to keith@actrix.gen.nz (Keith Stewart)]
 
>My wife has become interested through an acquaintance in Post-Polio
>Syndrome This apparently is not recognised in New Zealand and different
>symptons ( eg chest complaints) are treated separately. Does anone have
>any information on it
 
It would help if you (and anyone else asking for medical information on
some subject) could ask specific questions, as no one is likely to type
in a textbook chapter covering all aspects of the subject.  If you are
looking for a comprehensive review, ask your local hospital librarian.
Most are happy to help with a request of this sort.
 
Briefly, this is a condition in which patients who have significant
residual weakness from childhood polio notice progression of the
weakness as they get older.  One theory is that the remaining motor
neurons

# Can you do the following?
* Extract all email addresses
* Distinguish the header from the text body
* Determine if there is a quote in the message (like there is above)
* Find the most frequent words for each category
* Come up with your own questions and answer them

## Solutions are below

# Solutions

## Extracting emails
It appears all emails follow the line in the header that begins with 'From:'. The following captures emails as the sequence of characters that do not have a space, parentheses or greater than or less than signs, or line breaks in them. There must also be an at symbol in the sequence.

In [5]:
pattern = r'\bFrom:.*?([^ ()<]+@[^ (>\n]+)'
emails = news['text'].str.extract(pattern)
emails.head()

Unnamed: 0,0
0,nyeda@cnsvax.uwec.edu
1,ndallen@r-node.hub.org
2,mark@ardsley.business.uwo.ca
3,zmed16@trc.amoco.com
4,fcrary@ucsu.Colorado.EDU


## Extracting the header
It appears that the header begins at the start of the email and continues until it hits an empty line. The following matches all characters (including line breaks) up until two line breaks in a row. This should represent the header. The pattern `[\s\S]` represents all characters. The dot special character does not match line breaks.

The `*?` represents a non-greedy match, meaning the pattern will stop after the first match. If the question mark was absent, then it would match until the last two line breaks in a row. That's called **greedy**.

In [6]:
headers = news['text'].str.extract(r'([\s\S]*?)\n\n')
headers.head()

Unnamed: 0,0
0,From: nyeda@cnsvax.uwec.edu (David Nye)\nSubje...
1,From: ndallen@r-node.hub.org (Nigel Allen)\nSu...
2,From: mark@ardsley.business.uwo.ca (Mark Bramw...
3,From: zmed16@trc.amoco.com (Michael)\nSubject:...
4,From: fcrary@ucsu.Colorado.EDU (Frank Crary)\n...


### Example header

In [7]:
print(headers.loc[100, 0])

From: c23reg@kocrsv01.delcoelect.com (Ron Gaskins)
Subject: Re: Dumbest automotive concepts of all tim
Originator: c23reg@koptsw21
Keywords: Dimmer switch location (repost)
Organization: Delco Electronics Corp.
Lines: 22


### Finding posts with quotes
The assumption here is that the line begins with a greater than symbol.

In [8]:
filt = news['text'].str.contains(r'\n>')
posts_with_quotes = news.loc[filt, 'text']
print(posts_with_quotes.values[0])

From: nyeda@cnsvax.uwec.edu (David Nye)
Subject: Re: Post Polio Syndrome Information Needed Please !!!
Organization: University of Wisconsin Eau Claire
Lines: 21

[reply to keith@actrix.gen.nz (Keith Stewart)]
 
>My wife has become interested through an acquaintance in Post-Polio
>Syndrome This apparently is not recognised in New Zealand and different
>symptons ( eg chest complaints) are treated separately. Does anone have
>any information on it
 
It would help if you (and anyone else asking for medical information on
some subject) could ask specific questions, as no one is likely to type
in a textbook chapter covering all aspects of the subject.  If you are
looking for a comprehensive review, ask your local hospital librarian.
Most are happy to help with a request of this sort.
 
Briefly, this is a condition in which patients who have significant
residual weakness from childhood polio notice progression of the
weakness as they get older.  One theory is that the remaining motor
neurons

# Counting words per category
We first put the category into the index and extract just the body of the posts (this excludes the header). This returns a DataFrame with a single column with name 0. We select this column in the second line.

In [9]:
body = news.set_index('category')['text'].str.extract(r'[\s\S]*?\n\n([\s\S]+)')
body_series = body[0]
body_series.head()

category
sci.med               [reply to keith@actrix.gen.nz (Keith Stewart)]...
talk.politics.guns    Here is a press release from the White House.\...
misc.forsale          >\n>I hope you realize that for a cellular pho...
misc.forsale          \nI have an Alesis HR-16 drum machine for sale...
talk.politics.guns    In article <C4tsHu.Ew6@magpie.linknet.com> man...
Name: 0, dtype: object

### Extract each individual non-quote line
We then use `extractall` to capture a pattern for each individual line. The assumption we make is that the line must begin with a word character.

In [10]:
body_lines = body_series.str.extractall(r'[\n]+(\w.*)')
body_lines.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,0
category,match,Unnamed: 2_level_1
sci.med,0,It would help if you (and anyone else asking f...
sci.med,1,"some subject) could ask specific questions, as..."
sci.med,2,in a textbook chapter covering all aspects of ...
sci.med,3,"looking for a comprehensive review, ask your l..."
sci.med,4,Most are happy to help with a request of this ...
sci.med,5,"Briefly, this is a condition in which patients..."
sci.med,6,residual weakness from childhood polio notice ...
sci.med,7,weakness as they get older. One theory is tha...
sci.med,8,neurons have to work harder and so die sooner.
sci.med,9,David Nye (nyeda@cnsvax.uwec.edu). Midelfort ...


### Split into individual words
We then split on any non-word character and use `expand=True` to put each word in its own column.

In [11]:
split_words = body_lines[0].str.split(r'\W+', expand=True)
split_words.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3,4,5,6,7,8,9,...,27,28,29,30,31,32,33,34,35,36
category,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
sci.med,0,It,would,help,if,you,and,anyone,else,asking,for,...,,,,,,,,,,
sci.med,1,some,subject,could,ask,specific,questions,as,no,one,is,...,,,,,,,,,,
sci.med,2,in,a,textbook,chapter,covering,all,aspects,of,the,subject,...,,,,,,,,,,
sci.med,3,looking,for,a,comprehensive,review,ask,your,local,hospital,librarian,...,,,,,,,,,,
sci.med,4,Most,are,happy,to,help,with,a,request,of,this,...,,,,,,,,,,


### Stack words into a single column
Use the stack method to put all the words in a single column. This will put the column names into the index. We also,

In [12]:
stacked_words = split_words.stack().str.lower()
stacked_words.head(30)

category  match    
sci.med   0      0              it
                 1           would
                 2            help
                 3              if
                 4             you
                 5             and
                 6          anyone
                 7            else
                 8          asking
                 9             for
                 10        medical
                 11    information
                 12             on
          1      0            some
                 1         subject
                 2           could
                 3             ask
                 4        specific
                 5       questions
                 6              as
                 7              no
                 8             one
                 9              is
                 10         likely
                 11             to
                 12           type
          2      0              in
                 1               a


## Remove words less than 7 characters in length
These shorter words won't give us as much information about the topic as the longer ones.

In [13]:
long_word = stacked_words[stacked_words.str.len() >= 7]
long_word.head(20)

category  match    
sci.med   0      10          medical
                 11      information
          1      1           subject
                 4          specific
                 5         questions
          2      2          textbook
                 3           chapter
                 4          covering
                 6           aspects
                 9           subject
          3      0           looking
                 3     comprehensive
                 8          hospital
                 9         librarian
          4      7           request
          5      0           briefly
                 4         condition
                 7          patients
                 10      significant
          6      0          residual
dtype: object

### Groupby category and count the unique values
You can groupby an index level and the count the values for each group.

In [14]:
category_counts = long_word.groupby('category').value_counts().reset_index()
category_counts.columns = ['category', 'word', 'count']
category_counts.head(10)

Unnamed: 0,category,word,count
0,misc.forsale,condition,17
1,misc.forsale,excellent,11
2,misc.forsale,interested,9
3,misc.forsale,shipping,9
4,misc.forsale,windows,9
5,misc.forsale,printer,8
6,misc.forsale,publish,8
7,misc.forsale,contact,7
8,misc.forsale,software,7
9,misc.forsale,compatible,6


### Select top 10 words per category

In [15]:
top10_words = category_counts.groupby('category').head(10)
top10_words.head(20)

Unnamed: 0,category,word,count
0,misc.forsale,condition,17
1,misc.forsale,excellent,11
2,misc.forsale,interested,9
3,misc.forsale,shipping,9
4,misc.forsale,windows,9
5,misc.forsale,printer,8
6,misc.forsale,publish,8
7,misc.forsale,contact,7
8,misc.forsale,software,7
9,misc.forsale,compatible,6


### Fix the index
The index values are the old location of the rows. They don't make sense. Let's drop it.

In [16]:
top10_words = top10_words.reset_index(drop=True)
top10_words.head(20)

Unnamed: 0,category,word,count
0,misc.forsale,condition,17
1,misc.forsale,excellent,11
2,misc.forsale,interested,9
3,misc.forsale,shipping,9
4,misc.forsale,windows,9
5,misc.forsale,printer,8
6,misc.forsale,publish,8
7,misc.forsale,contact,7
8,misc.forsale,software,7
9,misc.forsale,compatible,6


### Get unique categories for querying

In [17]:
top10_words['category'].unique()

array(['misc.forsale', 'rec.autos', 'rec.sport.baseball', 'sci.med',
       'sci.space', 'talk.politics.guns'], dtype=object)

### Choose a couple categories

In [18]:
filt = top10_words['category'] == 'sci.space'
top10_words[filt]

Unnamed: 0,category,word,count
40,sci.space,telescope,27
41,sci.space,satellite,25
42,sci.space,national,24
43,sci.space,shuttle,22
44,sci.space,vehicle,18
45,sci.space,observatory,16
46,sci.space,because,15
47,sci.space,international,15
48,sci.space,spacecraft,14
49,sci.space,astronomical,13


In [19]:
filt = top10_words['category'] == 'talk.politics.guns'
top10_words[filt]

Unnamed: 0,category,word,count
50,talk.politics.guns,because,27
51,talk.politics.guns,federal,26
52,talk.politics.guns,believe,24
53,talk.politics.guns,against,23
54,talk.politics.guns,weapons,23
55,talk.politics.guns,without,23
56,talk.politics.guns,defense,22
57,talk.politics.guns,firearms,22
58,talk.politics.guns,control,21
59,talk.politics.guns,government,19
