In [1]:
import pandas as pd
import re

# First we'll import the data

In [5]:
df = pd.read_csv('allPostText_test.csv')

In [9]:
df['Text'].head()

0    \n« back to home\n  You can wear the flashiest...
1    \n« back to home\n  Exit polls show that Sebas...
2    \n« back to home\n  Is it going to be a five-y...
3    \n« back to home\n  The disgraceful thing is t...
4    \n« back to home\n  “I don’t recall any other ...
Name: Text, dtype: object

# Lets clean it a bit 

In [12]:
def clean(elem):
    elem = elem.replace('\n« back to home\n  ', '')
    elem = elem.replace('« previous postnext post »', '').strip()
    return elem

In [14]:
df['Text'].apply(clean).head(10)

0    You can wear the flashiest watch and keep your...
1    Exit polls show that Sebastian Kurz, 31, is ab...
2    Is it going to be a five-year electoral campai...
3    The disgraceful thing is that this man has bee...
4    “I don’t recall any other budget having given ...
5                                                     
6                                       I mean, really
7    Toni Bezzina, the member of parliament, entere...
8                                                     
9    David Agius today came forward officially as a...
Name: Text, dtype: object

In [17]:
df['Text'] = df['Text'].apply(clean)

In [19]:
df = df[df['Text'] != '']

# Lets find some text and numbers

First a [simple search](https://docs.python.org/2/library/re.html). 

In [23]:
text = df['Text'][10]

In [38]:
text

'Yesterday in the car I was listening to the lunchtime talk-show on the Nationalist Party’s radio station, hosted by Evelyn Vella Brincat, whose brother is the failed party leadership contender Frank Portelli. It was unbearable, but I felt I needed to suffer through it in the interest of journalism. David Agius, the party whip and contender for the post of deputy leader, was on with her.I finally switched off when Mrs Vella Brincat announced that it was David Agius’s birthday – how old is he, 10? – that those listening to the show should give him “the best birthday present ever by becoming members of the Nationalist Party, because I have known David for a long time and he has always been a party boy so he will want that more than anything” (translated from the Maltese).Then the intellectually challenged and free-loading Mr Agius, who should have been at his state-paid job at the Freeport at that time of day, interjected and said that what he wants more than anything is: “Li nara l-Part

In [30]:
re.search(r'car', text)

<_sre.SRE_Match object; span=(17, 20), match='car'>

In [36]:
re.search(r'car', text).group()

'car'

In [37]:
re.findall(r'car', text)

['car']

In [42]:
# Now lets match patterns

In [35]:
re.search(r'[0-9]', text).group()

'1'

In [39]:
re.findall(r'[0-9]', text)

['1', '0']

In [41]:
re.findall(r'[0-9]+', text)

['10']

lets go to [https://regexr.com](https://regexr.com/) to practice. Try to match the names in the text above.

# Regexing

In [45]:
re.findall(r'[A-Z]\w+\s[A-Z]\w+', text)

['Nationalist Party',
 'Evelyn Vella',
 'Frank Portelli',
 'David Agius',
 'Mrs Vella',
 'David Agius',
 'Nationalist Party',
 'Mr Agius',
 'Partit Nazzjonalista',
 'Partit Laburista',
 'Nationalist Party',
 'Labour Party',
 'Nationalist Party',
 'David Agius',
 'Adrian Delia',
 'Nationalist Party']

# lets make a function including a Regex

In [46]:
def regexing(elem):
    lst = re.findall(r'[A-Z]\w+\s[A-Z]\w+', elem)
    return lst

In [50]:
df['Names'] = df['Text'].apply(regexing)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [53]:
list(df['Names'])[:5]

[[],
 ['Sebastian Kurz',
  'Freedom Party',
  'Social Democrats',
  'Sebastian Kurz',
  'Christian Kern',
  'Social Democrats',
  'Christian Strache',
  'Sebastian Kurz'],
 ['Labour Party',
  'Prime Minister',
  'Nationalist Party',
  'Rank Xerox',
  'Kristy Debono',
  'Hermann Schiavone',
  'Nationalist Party'],
 ['Nationalist Party', 'The Nationalist', 'European Union'],
 []]

In [55]:
lst = list(df['Names'])

# Now we want to see, who is mentioned the most

In [58]:
# First we need to delete the empty lists:
lst = [x for x in lst if x != []]

In [59]:
lst

[['Sebastian Kurz',
  'Freedom Party',
  'Social Democrats',
  'Sebastian Kurz',
  'Christian Kern',
  'Social Democrats',
  'Christian Strache',
  'Sebastian Kurz'],
 ['Labour Party',
  'Prime Minister',
  'Nationalist Party',
  'Rank Xerox',
  'Kristy Debono',
  'Hermann Schiavone',
  'Nationalist Party'],
 ['Nationalist Party', 'The Nationalist', 'European Union'],
 ['Toni Bezzina',
  'Nationalist Party',
  'MP Robert',
  'Edwin Vassallo',
  'David Agius',
  'Mr Agius',
  'Toni Bezzina',
  'Robert Arrigo',
  'Nationalist Party'],
 ['David Agius',
  'Nationalist Party',
  'Edwin Vassallo',
  'Nationalist Party',
  'Chris Said',
  'Adrian Delia',
  'Dr Said',
  'Dr Said',
  'Mr Vassallo',
  'Dr Said',
  'David Agius',
  'Dr Said',
  'Mr Agius',
  'Dr Delia',
  'Dr Said',
  'Dr Delia',
  'Dr Said',
  'Mr Agius',
  'Dr Delia',
  'Nationalist Party',
  'Robert Arrigo',
  'Clyde Puli',
  'When David',
  'Chris Said',
  'Mr Puli',
  'Dr Delia',
  'Edwin Vassallo',
  'Dr Delia',
  'Mr Agius

Flatten [list](https://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python).

In [60]:
flat_list = [item for sublist in lst for item in sublist]

# Now lets count

In [66]:
pd.DataFrame(flat_list)[0].value_counts().head(10)

Nationalist Party    100
Adrian Delia          45
Prime Minister        25
Labour Party          19
Mrs Delia             17
Jean Pierre           16
Dr Delia              13
Mr Cutajar            11
Anton Rea             11
Rebecca Dimech        11
Name: 0, dtype: int64

In [68]:
df_names = pd.DataFrame(flat_list)

In [69]:
df_names.to_csv('names.csv')