![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)

# Foot and Mouth: Pre-processing (retrieval, segmentation, etc.)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Retrieval" data-toc-modified-id="Retrieval-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Retrieval</a></span><ul class="toc-item"><li><span><a href="#Check-the-.csv-exists-and-is-where-you-think-it-is" data-toc-modified-id="Check-the-.csv-exists-and-is-where-you-think-it-is-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Check the .csv exists and is where you think it is</a></span></li></ul></li><li><span><a href="#Get-the-data-in-order" data-toc-modified-id="Get-the-data-in-order-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Get the data in order</a></span><ul class="toc-item"><li><span><a href="#Rename-columns" data-toc-modified-id="Rename-columns-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Rename columns</a></span></li><li><span><a href="#Adding-Occupation-Column" data-toc-modified-id="Adding-Occupation-Column-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Adding Occupation Column</a></span></li><li><span><a href="#Preparing-to-split-columns" data-toc-modified-id="Preparing-to-split-columns-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Preparing to split columns</a></span><ul class="toc-item"><li><span><a href="#Pt1.-Where-to-split-the-Dataframe-Rows?" data-toc-modified-id="Pt1.-Where-to-split-the-Dataframe-Rows?-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Pt1. Where to split the Dataframe Rows?</a></span></li><li><span><a href="#Pt.-2---ACTUALLY-splitting-the-dataframe-into-2-(3!)-parts" data-toc-modified-id="Pt.-2---ACTUALLY-splitting-the-dataframe-into-2-(3!)-parts-2.3.2"><span class="toc-item-num">2.3.2&nbsp;&nbsp;</span>Pt. 2 - ACTUALLY splitting the dataframe into 2 (3!) parts</a></span></li></ul></li></ul></li><li><span><a href="#Regular-conditional-filtering-vs-Boolean-masking" data-toc-modified-id="Regular-conditional-filtering-vs-Boolean-masking-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Regular conditional filtering vs Boolean masking</a></span><ul class="toc-item"><li><span><a href="#Tilde-operator-+-boolean-masking-for-counting-Occupation-instances" data-toc-modified-id="Tilde-operator-+-boolean-masking-for-counting-Occupation-instances-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Tilde operator + boolean masking for counting Occupation instances</a></span></li></ul></li><li><span><a href="#Pre-Processing-Summary" data-toc-modified-id="Pre-Processing-Summary-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Pre-Processing Summary</a></span></li><li><span><a href="#Resources" data-toc-modified-id="Resources-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Resources</a></span></li></ul></div>

## Retrieval


The first step in text-mining, or any form of data-mining, is retrieving a data set to work with. Within text-mining, or any language analysis context, one data set is usually referred to as 'a corpus' while multiple data sets are referred to as 'corpora'. 'Corpus' is a latin-root word and therefore has a funny plural. 

For text-mining, a corpus can be:
- a set of tweets, 
- the full text of an 18th centrury novel,
- the contents of a page in the dictionary, 
- minutes of local council meetings, 
- random gibberish letters and numbers, or
- just about anything else in text format. 

Instead, for the purposes of this notebook, we will be retrieving a .csv file that we created in a different notebook (importing_multiple_rtf.ipynb). 

### Check the .csv exists and is where you think it is

In [1]:
# It is good practice to always start by importing the modules and packages you will need. 

import os                         # os is a module for navigating your machine (e.g., file directories).
import pandas as pd

# List all of the files in the "data" folder that is provided to you
print("")
for file in os.listdir("./data/foot_mouth"):
   print("2. One of the files in ./data is...", file)
print("")



2. One of the files in ./data is... foot_mouth_original.xls
2. One of the files in ./data is... text.csv



In [2]:
foot_mouth_df = pd.read_csv ('../code/data/foot_mouth/text.csv')    # loads the specific file into a python-only object
print (foot_mouth_df[:10])                                          # prints the 1st 10 rows to get a sense of its contents

   Unnamed: 0                0  \
0           0  5407diary02.rtf   
1           1  5407diary03.rtf   
2           2  5407diary07.rtf   
3           3  5407diary08.rtf   
4           4  5407diary09.rtf   
5           5  5407diary10.rtf   
6           6  5407diary13.rtf   
7           7  5407diary14.rtf   
8           8  5407diary15.rtf   
9           9  5407diary16.rtf   

                                                   1  
0  \n\nInformation about diarist\nDate of birth: ...  
1  Information about diarist\nDate of birth: 1966...  
2  \n\nInformation about diarist\nDate of birth: ...  
3  Information about diarist\nDate of birth: 1963...  
4  Information about diarist\nDate of birth: 1981...  
5  Information about diarist\nDate of birth: 1937...  
6  Information about diarist\nDate of birth: 1947...  
7  \nInformation about diarist\nDate of birth: 19...  
8  Information about diarist\nDate of birth: 1949...  
9  \nInformation about diarist\nDate of birth: 19...  


_______________________________________________________________________________________________________________________________
Right. We have three columns, one is a number, one is the name of the original .rtf file that the text came from, and one is the text. 

Looks a bit messy. 

Before we go further, it helps to know what kind of variable foot_mouth_df is. Run/Shift+Enter the next code block to find out!

In [None]:
type(foot_mouth_df)

_______________________________________________________________________________________________________________________________
This tells us that 'foot_mouth_df' is a pandas DataFrame. That is not a bad thing. 

Congratulations! We are done with the retreival portion of this process. The rest won't be quite so straightforward because next up... Accessing! This allows us to get individual rows, columns and/or cells to inspect, change, label, or split them. 

## Get the data in order

### Rename columns

Obviously, we can access any column, row or cell without using named labels. But it might be easier to give some of the things named labels. This makes more sense with columns - especially if we are going to split the columns into lots of other columns and it will be hard to keep track of what the numbered columns refer to. 

For now, lets keep working with the foot_mouth_df rather than any new variables you might have created. 

In [3]:
foot_mouth_df.head()        # name_of_dataframe.head() is an easy way to see the column names and 1st 5 rows

Unnamed: 0.1,Unnamed: 0,0,1
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...


We can see that our columns have pretty stupid names. Let's change that. 

In [3]:
# This code defines the columns. It will overwrite whatever is already there for column names. 
# It will also owerwrite that weird "unnamed:0" thing in the first column that we had.  as is this case. )
foot_mouth_df.columns = ["Number", "Filename", "everything_else"]

Let's check and see if it worked. 

In [5]:
foot_mouth_df.head()        # name_of_dataframe.head() is an easy way to see the column names and 1st 5 rows

Unnamed: 0,Number,Filename,everything_else
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...


Great! 

### Adding Occupation Column

First off we will create a new Pandas Dataframe with an occupation column in it. This is what we will be doing the rest of the pre-processing on!

In [23]:
oc_foot_mouth = foot_mouth_df.assign(Occupation = foot_mouth_df['everything_else'].str.extract(r'(\w+\s+\d{1,2})'))

Now let's double-check that this didn't affect the original foot_mouth dataframe! (I will be using the original dataframe for processing)

In [10]:
foot_mouth_df.head()

Unnamed: 0,Number,Filename,everything_else
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...


Perfect! Now how about we work on splitting that "everything_else" column in our Occupation dataframe more useful things.

### Preparing to split columns

#### Pt1. Where to split the Dataframe Rows?

Before we get to actually splitting the "everything else" column, let's just triple check that it is consistent all the way down. 

This is important because usually we split according to a specific position (like "after the 10th character" or after a specific delimiter (like "after every comma"). We don't actually know that we have either a position or delimiter to use that applies to all of the files.  


We could look at *every* cell in the column of interest, or we could look at the first few and the last few and jump to some conclusions. Since we already know how to find the head, let's compare that head to the tail. 


In [None]:
print(foot_mouth_df.head())
print(" ")
print(foot_mouth_df.tail())

Hmmmm. They are not all consistent. But there is a lot of consistency! The diary files seem to start with "Information about diarist" while the interview files start with "date of interview". 

It seems like we can't split the columns until we split this data frame into 2, one for diary files and one for interview files. How would you go about doing that? Steps to consider might include:
* find the last row of the diary entries and the first row of the interview entries (using access rows and/or access cells)
* save a new "diary" variable that contains all of the columns for all of the diary rows
* save a new "interview" variable that contains all of the columns for all of the interview rows

In [None]:
# Finding where to split the columns

foot_mouth_df.loc[:50] # Quick inspection to see where the splits should be put


Okay! So looks like there is a random extra type of file for 'Group Discussions' that we will also need to split up!

But that's fine wont take much to split it - its just one extra line of code!

#### Pt. 2 - ACTUALLY splitting the dataframe into 2 (3!) parts

In [32]:
#Setting variables that split up the DataFrame

diary_file = oc_foot_mouth.loc[:39]     # Saving variable for all diary rows
group_file = oc_foot_mouth.loc[40:45]   # Saving variables for all group interview rows
interview_file = oc_foot_mouth.loc[46:] # Saving variable for all interview rows

In [13]:
# Now quickly checking the type!

type(diary_file)
type(group_file)
type(interview_file)

pandas.core.frame.DataFrame

Okay and now to check the contents of these split dataframes!

In [14]:
diary_file[:20]

Unnamed: 0,Number,Filename,everything_else,Occupation
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...,Group 6
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...,Group 6
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,Group 5
5,5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...,Group 5
6,6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...,Group 5
7,7,5407diary14.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5
8,8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...,Group 5
9,9,5407diary16.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5


In [16]:
group_file[:20]

Unnamed: 0,Number,Filename,everything_else,Occupation
40,40,5407fg01.rtf,\nGroups Discussion with Members of Farmers F...,Group 1
41,41,5407fg02.rtf,Groups Discussion with Members of Small Busine...,Group 2
42,42,5407fg03.rtf,\n\nGroups Discussion with Members of Agricul...,Group 3
43,43,5407fg04.rtf,\nNO AUDIO RECORDING\n\nGroups Discussion with...,Group 4
44,44,5407fg05.rtf,\n\nGroups Discussion with Community Group of ...,Group 5


In [17]:
interview_file[:50]

Unnamed: 0,Number,Filename,everything_else,Occupation
46,46,5407int02.rtf,\nDate of Interview: 14/03/02\n\nInformation a...,Group 6
47,47,5407int03.rtf,\nDate of Interview: 08/03/02\n\nInformation a...,Group 6
48,48,5407int07.rtf,\nDate of Interview: 14/03/02\n\nInformation a...,Group 6
49,49,5407int08.rtf,\nDate of Interview: 06/03/02\n\nInformation a...,Group 6
50,50,5407int09.rtf,\nDate of Interview: 26/02/02\n\nInformation a...,Group 5
51,51,5407int10.rtf,\nDate of Interview: 08/03/02\n\nInformation a...,Group 5
52,52,5407int13.rtf,\nDate of Interview: 19/03/02\n\nInformation a...,Group 5
53,53,5407int14.rtf,\nDate of Interview: 25/02/02\n\nInformation a...,Group 5
54,54,5407int15.rtf,\nDate of Interview: 25/02/02\n\nInformation a...,Group 5
55,55,5407int16.rtf,\nDate of Interview: 07/03/02\n\nInformation a...,Group 5


Looks good! Now let's move on to some data exploration:

## Regular conditional filtering vs Boolean masking

First we will be using Regular conditional filtering and Boolean Masking to see if there are any missing values in the Occupation column.

These methods are merely another way to filter your dataframe and inspect values that interest you.

Neither method is superior, although boolean masking is said to be faster than regular filtering. 

Let's say we want to inspect the NaN values in the Occupation column...

In [22]:
# Regular conditional filtering often looks like this...

oc_foot_mouth[oc_foot_mouth['Occupation'].isna()]

Unnamed: 0,Number,Filename,everything_else,Occupation


Okay there are none - perfect! But can also double check this using Boolean masking!

In [25]:
# But, we can do the same thing by creating a boolean mask...

missing = oc_foot_mouth['Occupation'].isna()

# If we inspect this mask, we can see that it has returned a series of True/False values based on the condition
missing

0     False
1     False
2     False
3     False
4     False
      ...  
82    False
83    False
84    False
85    False
86    False
Name: Occupation, Length: 87, dtype: bool

In [58]:
type(missing)

pandas.core.series.Series

In [26]:
# Now we can filter our dataframe with this boolean series
# This will return the rows in which the condition = True

oc_foot_mouth[missing]

Unnamed: 0,Number,Filename,everything_else,Occupation


In [28]:
len(oc_foot_mouth[missing])

0

In [25]:
group_file

Unnamed: 0,Number,Filename,everything_else,Occupation
40,40,5407fg01.rtf,\nGroups Discussion with Members of Farmers F...,Group 1
41,41,5407fg02.rtf,Groups Discussion with Members of Small Busine...,Group 2
42,42,5407fg03.rtf,\n\nGroups Discussion with Members of Agricul...,Group 3
43,43,5407fg04.rtf,\nNO AUDIO RECORDING\n\nGroups Discussion with...,Group 4
44,44,5407fg05.rtf,\n\nGroups Discussion with Community Group of ...,Group 5


Yup! Still nothing! Ok so now lets get onto seeing how many files there are for each occupation!

### Tilde operator + boolean masking for counting Occupation instances

Another filtering tip which is quite useful, is learning how to use the '~' tilde operator.
This operator is used to negate the Boolean values in the dataframe, i.e., True becomes False and False becomes True!

This can be quite useful. For instance, let's say we wanted to filter our dataframe by Occupation, but we want to look at every group apart from Group 6.

In [5]:
# Our instinct might be to do something like this...

groups = ['Group 1', 'Group 2', 'Group 3', 'Group 4', 'Group 5']

base_foot_mouth[base_foot_mouth['Occupation'].isin(groups)]


# The isin() method is another way of applying multiple conditions for filtering

# However, this is a bit laborious..

Unnamed: 0,Number,Filename,everything_else,Dates,Gender,Occupation
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,25th February 2002,F,Group 5
5,5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...,11th March 2002,M,Group 5
6,6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...,18th March 2002,M,Group 5
7,7,5407diary14.rtf,\nInformation about diarist\nDate of birth: 19...,,F,Group 5
8,8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...,,F,Group 5
...,...,...,...,...,...,...
82,82,5407int49.rtf,\nDate of Interview: 22/01/02\n\nInformation a...,22/01/02,F,Group 1
83,83,5407int52.rtf,\nDate of Interview: 08/01/02\n\nInformation a...,08/01/02,M,Group 1
84,84,5407int53.rtf,\nDate of Interview: 21/01/02\n\nInformation a...,21/01/02,M,Group 1
85,85,5407int54.rtf,\nDate of Interview: 17/01/02\n\nInformation a...,17/01/02,M,Group 1


But I will try to just use the == operator and see if that will work easier!

In [38]:
oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 6']

Unnamed: 0,Number,Filename,everything_else,Occupation
0,0,5407diary02.rtf,"\n\nInformation about diarist\nDate of birth: 1975\nGender: M\nOccupation: Group 6\nGeographic region: North Cumbria\n\n\nDiary 1 \nThursday Meeting @ N Lakes\nFriday TB testing on restocking farm. Usual chat and DEFRA comments\nThe meeting (research panel gp 6) at the North Lakes was interesting. It surprises me sometimes how people (myself included) never seem to tire of the same stories and complaints over how the crisis was handled. Some of the episodes recounted must have been told dozens of times over the last year but whoever says it always seems just as keen to say it again – Perhaps a reflection of how deeply people feel about the events of the last year. Having said that, most of the resentments and rants that I hear on daily farm visits are focused fairly and squarely at DEFRA and not FMD virus. Farmers seem far more upset at the constriction put on them by DEFRA than they do by the loss of stock now, although I know and saw how utterly devastated most were when ...",Group 6
1,1,5407diary03.rtf,"Information about diarist\nDate of birth: 1966\nGender: F\nOccupation: Group 6\nGeographic region: North Cumbria\n\n\n\nDiary 1\nMonday was the usual long hard grind. I accept that I have to put in 10 – 12 hours and I don’t mind doing the work because it’s not physically or mentally taxing but I do hate not having a lunch break, just that little bit of selfish time to site, have a cigarette, take the dogs down the river, see the horses…whatever. I do resent that fact that W (one of the bosses) almost always gets a lunch hour. B (the other boss) has gone up tremendously in my opinion for the way that he gets on with the work. He starts early, finishes late, hates DERFA paperwork and rarely complains. It is definitely grinding them down because they work like that at least 4 days a week. It has been a huge advantage this last year being part-time at work. My days off obviously aren’t my own as they used to be, but I do get away from the phone and the demands of clients. Some of our c...",Group 6
2,2,5407diary07.rtf,"\n\nInformation about diarist\nDate of birth: 1964\nGender: F\nOccupation: Group 6\nGeographic region: North Cumbria\n\n\n\nWeek beginning 4th March 02\n\nMonday 4th March\nWe decided we now need more staff, a new vet and a part time receptionist, this could take us back up to our previous staffing level pre FM, bar a vet. But this was probably going to be all the recruitments we would make this year. It’s a good sign As Things begin to get back to normal!! Work is increasing quite a lot now most of our farmers have restocked, although a lot of them and us are still concerned about the future.\nTuesday 5th March\nA difficult day today with licences, two of our farmers needed Sole Occupancy Authentitys (SOA) and there were queries with their land, unless some of their fields could be redefined, they were worried their stock would suffer. During the FM these worries have shown how much they care about their stock and find it very frustrating when they don’t understand why we can’...",Group 6
3,3,5407diary08.rtf,"Information about diarist\nDate of birth: 1963\nGender: M\nOccupation: Group 6\nGeographic region: North Cumbria\n\n\nSaturday 9th March 2002\nAn old African proverb states, ""The best time to plant a tree was 20 years ago. The next best time is now.""\nI should have started this diary over a year ago to keep track of changes in the unrolling of the FMD epidemic and my feelings towards it. Today is probably a good day to start the diary as I was about to sit down after a really bad week to write some comments for the lessons learned inquiry and thought I should check the web site for an update. I found out to my considerable annoyance that the inquiry was coming to Cumbria to meet with DEFRA and hold the open meeting on Tuesday night, and if you wanted a ticket to attend, then you had to apply by a week ago. The overall impression is that the govt do not want to learn lessons. I was looking after kids as [wife] was on counselling course and I was on call Sat morn. The vets were busy ...",Group 6
45,45,5407fg06.rtf,"\n\nGroup Discussion Panel Members, Group 6 – Animal and Human Health related occupations – 28/02/02\n\n7 Panel Members were present and they are identified thus:\nL S T G M D Ly\nWomen L T G M Ly\nMen S D\n\n( )\tBracket with number in indicates missing word/words\n[ ]\tSquare bracket indicates action\n\nPoor recording constant background noise. Was difficult to identify certain speakers as they had soft voices or they were sitting away from mike.\n\nSo if anybody can remember back to February when you first heard that Foot & Mouth was diagnosed and the thoughts that were going through your head at that time.\n\nL I was so shocked it was just a big boggy of a disease that was never gonna happen in Britain.\n\nS Something that you’d had [over talking] occasional 20 minute lecture on, don’t know what it looks like but I not likely to see it see it. \n\nL If ever went abroad I mightn’t have known about it.\n\nS Even when it came, I remember, sounds really a thing that eve...",Group 6
46,46,5407int02.rtf,"\nDate of Interview: 14/03/02\n\nInformation about Panel Member\nDate of birth: 1975\nGender: M\nOccupation: Group 6\nGeographic region: North Cumbria\n\nThere is sensitive information on pages 35 and 36 of this transcript, which should not be used in any way that could identify the respondent\n\nInterview was held at the respondent’s home, a large terraced house near the centre of town. We sat at the kitchen table. Respondent seemed relaxed, happy to talk . I had rearranged the time of this interview so that I could attend the Anderson Inquiry meeting in Carlisle 2 days earlier. He had asked me about the meeting and I told him that people had noted inconsistencies in the way vets handled “signing off “procedures. He was explaining possible reasons for this when we switched on the tape:\n\n‘Cos you’re not, best not break the rules but sort of bend them to, say well, “I know you I’ve known you for the last three years and my boss has known you for the last twenty years, I know your ...",Group 6
47,47,5407int03.rtf,"\nDate of Interview: 08/03/02\n\nInformation about Panel Member\nDate of birth: 1966\nGender: F\nOccupation: Group 6\nGeographic region: North Cumbria\n\nTape starts part way through a conversation about the start of FMD\n\n…yeah, its like other people say it’s a long, long away is Essex you know its somewhere where I’ve never been. And you think well its come from a pig farm, so until you found out where the pig had come from you didn’t really know where it was gonna have the focus you know.\n\nDo pigs carry more FMD virus?\n\nPigs excrete more virus yeah, they excrete more virus per breath, they are also quite susceptible to aerosol infection as are cattle, because they have a big lung volume. Sheep are not as susceptible because they don’t take in as much air with every breath. \n\nBut then when it started to get up north and we had Longtown. I mean, I come from a village that’s about 15 miles north of Longtown, so I knew about Longtown Auction Mart. I used to take sheep the...",Group 6
48,48,5407int07.rtf,"\nDate of Interview: 14/03/02\n\nInformation about Panel Member\nDate of birth: 1964\nGender: F\nOccupation: Group 6\nGeographic region: North Cumbria\n\nI arrived by taxi as the family car was unavailable and walked through the reception area. L held open the door to her kitchen, ""come on in"" where her golden retriever made a fuss of me. ""We'll go through to the dining room"" - so we settled in one of the rooms looking over the A road and beyond to rolling pastures. L's 15 year old daughter popped in , ""What's for tea then"" and the phone which rang in the background must have been picked up by someone in a different part of the house. We settled at the round table and started the tape.\n\nWhere I usually start is by asking you to tell me a little bit about you, your background and your family.\n\nWe bought the practice about twelve years ago now, with my hubbie the vet, and one daughter, and the dog. Only one dog is ours, the other is somebody else’s. Me and G met when we both...",Group 6
49,49,5407int08.rtf,"\nDate of Interview: 06/03/02\n\nInformation about Panel Member\nDate of birth: 1963\nGender: M\nOccupation: Group 6\nGeographic region: North Cumbria\n\nI drove into the car-park of the modern veterinary practice. The receptionist warned me that M had been called out on an emergency so I waited 40 minutes or so. M greeted me warmly and took me to a meeting-sized room upstairs that was dominated by a large table with chairs and shelves of text-books and manuals. We moved into an adjacent small kitchen to make tea and then settled down with the tape switched on. \n\nCan you tell me a little bit about yourself and your background?\n\nBorn in Sheffield, brought up south of England, studied at Edinburgh University, got married, worked in N as a mixed practice for two years, and I’ve been working here in Cumbria since [then]. I’m married with four children.\n\nAnd why Cumbria?\n\nI wanted to do dairy practice. I spent quite a long time in the last six months when I was at N working ...",Group 6


That looks about right! Now let's use this with the len() function to get the counts we want!

In [36]:
print(len(oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 1']))
print(" ")

print(len(oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 2']))
print(" ")

print(len(oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 3']))
print(" ")

print(len(oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 4']))
print(" ")

print(len(oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 5']))
print(" ")

print(len(oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 6']))
print(" ")

13
 
13
 
17
 
16
 
19
 
9
 


Okay os a bit of discrepancy in the sizes! Let's see if it has anything to do with a discrepancy in the respondants recorded for each type of data collection (diary vs group focus vs interview)

In [37]:
print(len(diary_file[diary_file['Occupation'] == 'Group 1']))
print(" ")

print(len(diary_file[diary_file['Occupation'] == 'Group 2']))
print(" ")

print(len(diary_file[diary_file['Occupation'] == 'Group 3']))
print(" ")

print(len(diary_file[diary_file['Occupation'] == 'Group 4']))
print(" ")

print(len(diary_file[diary_file['Occupation'] == 'Group 5']))
print(" ")

print(len(diary_file[diary_file['Occupation'] == 'Group 6']))
print(" ")

6
 
6
 
8
 
7
 
9
 
4
 


In [38]:
print(len(group_file[group_file['Occupation'] == 'Group 1']))
print(" ")

print(len(group_file[group_file['Occupation'] == 'Group 2']))
print(" ")

print(len(group_file[group_file['Occupation'] == 'Group 3']))
print(" ")

print(len(group_file[group_file['Occupation'] == 'Group 4']))
print(" ")

print(len(group_file[group_file['Occupation'] == 'Group 5']))
print(" ")

print(len(group_file[group_file['Occupation'] == 'Group 6']))
print(" ")

1
 
1
 
1
 
1
 
1
 
1
 


In [39]:
print(len(interview_file[interview_file['Occupation'] == 'Group 1']))
print(" ")

print(len(interview_file[interview_file['Occupation'] == 'Group 2']))
print(" ")

print(len(interview_file[interview_file['Occupation'] == 'Group 3']))
print(" ")

print(len(interview_file[interview_file['Occupation'] == 'Group 4']))
print(" ")

print(len(interview_file[interview_file['Occupation'] == 'Group 5']))
print(" ")

print(len(interview_file[interview_file['Occupation'] == 'Group 6']))
print(" ")

6
 
6
 
8
 
8
 
9
 
4
 


Okay so it looks like this discrepancy is the same across the formats! In which case we will definitely need to calculate one polarity average (preferably a mean result) for each of the occupations to account for this! I think if I were to plot multiple averages for each occupation the variance in occupation distribution might skew things. 

Further resources

If you want to know more about different ways of filtering dataframes, please visit the link below:

* https://towardsdatascience.com/8-ways-to-filter-pandas-dataframes-d34ba585c1b8

## Pre-Processing Summary

In [2]:
# EVERYTHING YOU NEED TO DOWNLOAD IF THINGS GO WRONG

import os                         # os is a module for navigating your machine (e.g., file directories).
import pandas as pd

# List all of the files in the "data" folder that is provided to you
print("")
for file in os.listdir("./data/foot_mouth"):
   print("2. One of the files in ./data is...", file)
print("")

# Renaming Columns
foot_mouth_df = pd.read_csv ('../code/data/foot_mouth/text.csv') 

foot_mouth_df.columns = ["Number", "Filename", "everything_else"]
foot_mouth_df.head()
print(" ")

#Creating New Dataframe with Occupation Column
oc_foot_mouth = foot_mouth_df.assign(Occupation = foot_mouth_df['everything_else'].str.extract(r'(\w+\s+\d{1,2})'))

oc_foot_mouth.head() # checking if it worked!

# Splitting DataFrames
diary_file = oc_foot_mouth.loc[:39]     # Saving variable for all diary rows
group_file = oc_foot_mouth.loc[40:45]   # Saving variables for all group rows
interview_file = oc_foot_mouth.loc[46:] # Saving variable for all interview rows

print("DataFrames Successfully split!")


2. One of the files in ./data is... foot_mouth_original.xls
2. One of the files in ./data is... text.csv

 
DataFrames Successfully split!


## Resources

If you want to know more about RegEx patterns (this one took me a while to figure out, so I clearly need to read up on them!) I suggest the following resources:

* https://www.dataquest.io/blog/regular-expressions-data-scientists/
* https://stackoverflow.com/questions/71499365/how-to-extract-date-in-month-d-yr-format-using-regex (for easier to copy/paste date formatting)

This is also a useful resource if you want to test your RegEx expressions:

* https://regex101.com - I used it to help figure out why my regex date pattern wasn't working