# Filtering Documents with Pandas

This notebook demonstrates how you can process and filter text documents using Pandas

In [1]:
import pandas

In [None]:
# ...

In [2]:
# for speed we will use our small 5k corpus
url = "https://github.com/ValRCS/BSSDH_22/raw/main/corpora/lv_old_newspapers_5k.tsv"
# if this was zip we would add compression='zip' parameter
df = pandas.read_csv(url, sep="\t")
df.shape

(4999, 4)

In [3]:
df.head()

Unnamed: 0,Language,Source,Date,Text
0,Latvian,rekurzeme.lv,2008/09/04,"""Viņa pirmsnāves zīmītē bija rakstīts vienīgi ..."
1,Latvian,diena.lv,2012/01/10,info@zurnalistiem.lv
2,Latvian,bauskasdzive.lv,2007/12/27,"Bhuto, kas Pakistānā no trimdas atgriezās tika..."
3,Latvian,bauskasdzive.lv,2008/10/08,Plkst. 4.00 Samoilovs / Pļaviņš (pludmales vol...
4,Latvian,diena.lv,2011/10/05,"CVK bija vērsusies Skaburska, lūdzot izskaidro..."


In [5]:
# df.Language.unique() # let's check what languages the articles use # Language is the name of the column
# if column names use spaces you need to use full syntax
df["Language"].unique()  # same as above but will work with spaces

array(['Latvian'], dtype=object)

In [None]:
# Looks like we have a single language
# How about Sources?

In [6]:
df.Source.unique()

array(['rekurzeme.lv', 'diena.lv', 'bauskasdzive.lv', 'zz.lv', 'ntz.lv',
       'rv.lv', 'la.lv', 'nra.lv', 'ziemellatvija.lv', 'db.lv',
       'bdaugava.lv', 'dzirkstele.lv', 'staburags.lv'], dtype=object)

In [7]:
# How to get only articles from a single Source?
# note: Diena is a Latvian newspaper - Day in English
# you are creating a Dataframe (techically a  view from original DataFrame) which is a subset
diena_df = df[df.Source == 'diena.lv']
diena_df.shape

(634, 4)

In [8]:
# let's check how many different sources we have in this new dataframe
diena_df.Source.unique()

array(['diena.lv'], dtype=object)

In [9]:
diena_df.head()

Unnamed: 0,Language,Source,Date,Text
1,Latvian,diena.lv,2012/01/10,info@zurnalistiem.lv
4,Latvian,diena.lv,2011/10/05,"CVK bija vērsusies Skaburska, lūdzot izskaidro..."
8,Latvian,diena.lv,2011/11/26,"PĒDĒJĀ, kontrolēja PĀRDAUGAVAS telpu, izņemot ..."
20,Latvian,diena.lv,2010/02/12,Grezna kompānija
23,Latvian,diena.lv,2011/10/25,18. decembrī muzejā tiks svinēts Bluķa vakars ...


In [10]:
document_list = list(diena_df.Text)
len(document_list)

634

In [11]:
document_list[:3]

['info@zurnalistiem.lv',
 'CVK bija vērsusies Skaburska, lūdzot izskaidrot, kāpēc viņa palikusi aiz svītras, ja sākotnēji tika ziņots, ka viņa ir ievēlēta.',
 'PĒDĒJĀ, kontrolēja PĀRDAUGAVAS telpu, izņemot LATVIEŠIEM pēkšņi atgūstot Bolderājas telpu, kur latviešiem ar savu karakuģu iznīcinošu uguni izlīdzēja briti un franči un veiksmīgu pašu organizēto DESANTU izdevās atkārtoti ieņemt telpu un Daugavgrīvas cietoksni, saņemot gūstā pustūkstoti krievu izcelsmes bermontiešu, kas LATVIJAS armijai palīdzēja radīt placdarmu nākotnes notikumiem, nedodot iespēju „kņazam”, Daugavai aizsalstot, to pāriet un sākt LATVIJAI liktenīgu UZBRUKUMU RĪGAI no ziemeļiem... Šķita dižais „kņazs” gaidīja Daugavas pārklāšanos ar ledu, kas']

In [13]:
with open("diena_docs.txt", mode="w", encoding="utf-8") as f:
    f.writelines("\n".join(document_list)) # so each document separated by newline
    # f.writelines(document_list) # this will write without \n so ONE BIG STRING

In [14]:
# How about Date?
# First let's check our data types
diena_df.info() # so object means string here

<class 'pandas.core.frame.DataFrame'>
Int64Index: 634 entries, 1 to 4996
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Language  634 non-null    object
 1   Source    634 non-null    object
 2   Date      634 non-null    object
 3   Text      634 non-null    object
dtypes: object(4)
memory usage: 24.8+ KB


In [None]:
# Looks like all our columns are objects (so strings)
# Pandas also has a special datetime data type
# More on this here: https://pandas.pydata.org/docs/reference/arrays.html?highlight=datetime#datetime-data

# In our case we will use regular string methods to filter our years

In [None]:
# Let's make a new column which will contain year

In [15]:
diena_df["CurrentYear"] = 2022 # this would create a new column with 2022 for ALL rows
diena_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Language,Source,Date,Text,CurrentYear
1,Latvian,diena.lv,2012/01/10,info@zurnalistiem.lv,2022
4,Latvian,diena.lv,2011/10/05,"CVK bija vērsusies Skaburska, lūdzot izskaidro...",2022
8,Latvian,diena.lv,2011/11/26,"PĒDĒJĀ, kontrolēja PĀRDAUGAVAS telpu, izņemot ...",2022
20,Latvian,diena.lv,2010/02/12,Grezna kompānija,2022
23,Latvian,diena.lv,2011/10/25,18. decembrī muzejā tiks svinēts Bluķa vakars ...,2022


In [None]:
# Notice the warning, pandas actually is referencing the original BIG dataframe 
# To avoid this warning we could have made a copy of our dataframe first diena_copy = diena_df.copy() 
# usually this is not necessary


In [16]:
# okay we made a new column with current year,but that is not useful here
# we want the year of the publication
# to get year we will split the Date column and save the first part
diena_df["Year"] = df.Date.str.split("/").str[0]
diena_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,Language,Source,Date,Text,CurrentYear,Year
1,Latvian,diena.lv,2012/01/10,info@zurnalistiem.lv,2022,2012
4,Latvian,diena.lv,2011/10/05,"CVK bija vērsusies Skaburska, lūdzot izskaidro...",2022,2011
8,Latvian,diena.lv,2011/11/26,"PĒDĒJĀ, kontrolēja PĀRDAUGAVAS telpu, izņemot ...",2022,2011
20,Latvian,diena.lv,2010/02/12,Grezna kompānija,2022,2010
23,Latvian,diena.lv,2011/10/25,18. decembrī muzejā tiks svinēts Bluķa vakars ...,2022,2011


In [17]:
# let's check how many years our mini corpus covers
diena_df.Year.unique()

array(['2012', '2011', '2010', '2009', '2008', '2007', '2006'],
      dtype=object)

In [18]:
# let's check how many articles we have for each year
diena_df.Year.value_counts()

2011    442
2012    106
2010     41
2009     26
2008     15
2007      3
2006      1
Name: Year, dtype: int64

In [None]:
# Let's get year 2011

In [19]:
diena_2011 = diena_df[diena_df.Year == "2011"]  # Notice we use quotes for year since it is still a string
diena_2011.head()

Unnamed: 0,Language,Source,Date,Text,CurrentYear,Year
4,Latvian,diena.lv,2011/10/05,"CVK bija vērsusies Skaburska, lūdzot izskaidro...",2022,2011
8,Latvian,diena.lv,2011/11/26,"PĒDĒJĀ, kontrolēja PĀRDAUGAVAS telpu, izņemot ...",2022,2011
23,Latvian,diena.lv,2011/10/25,18. decembrī muzejā tiks svinēts Bluķa vakars ...,2022,2011
25,Latvian,diena.lv,2011/12/01,Ceturtdien iekšlietu ministrs Rihards Kozlovsk...,2022,2011
28,Latvian,diena.lv,2011/12/28,(Rakstā ir 2235 simboli...),2022,2011


In [20]:
diena_2011.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 442 entries, 4 to 4990
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Language     442 non-null    object
 1   Source       442 non-null    object
 2   Date         442 non-null    object
 3   Text         442 non-null    object
 4   CurrentYear  442 non-null    int64 
 5   Year         442 non-null    object
dtypes: int64(1), object(5)
memory usage: 24.2+ KB


In [21]:
# Notice how CurrentYear is int64 - 64 bit integer that can hold huge values
# we can convert a string to integer if we want to do numeric operations
# it would be useful if we wanted to filter multiple years
# because string comparison > < works differently than numeric

diena_df["Year_Numeric"] = diena_df.Year.astype(int)
# here we are creating a new column but we could have overwritten an existing column as well
# for exbample:
# diena_df["Year"] = diena_df.Year.astype(int) # would overwrite old column with new column - same data different data type
diena_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Language,Source,Date,Text,CurrentYear,Year,Year_Numeric
1,Latvian,diena.lv,2012/01/10,info@zurnalistiem.lv,2022,2012,2012
4,Latvian,diena.lv,2011/10/05,"CVK bija vērsusies Skaburska, lūdzot izskaidro...",2022,2011,2011
8,Latvian,diena.lv,2011/11/26,"PĒDĒJĀ, kontrolēja PĀRDAUGAVAS telpu, izņemot ...",2022,2011,2011
20,Latvian,diena.lv,2010/02/12,Grezna kompānija,2022,2010,2010
23,Latvian,diena.lv,2011/10/25,18. decembrī muzejā tiks svinēts Bluķa vakars ...,2022,2011,2011


In [22]:
# why integers from strings
"9" > "20" # this might not be what you want

True

In [23]:
# we want this
9 > 20

False

In [24]:
diena_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 634 entries, 1 to 4996
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Language      634 non-null    object
 1   Source        634 non-null    object
 2   Date          634 non-null    object
 3   Text          634 non-null    object
 4   CurrentYear   634 non-null    int64 
 5   Year          634 non-null    object
 6   Year_Numeric  634 non-null    int64 
dtypes: int64(2), object(5)
memory usage: 39.6+ KB


In [25]:
diena_df.columns

Index(['Language', 'Source', 'Date', 'Text', 'CurrentYear', 'Year',
       'Year_Numeric'],
      dtype='object')

In [26]:
# lets get years 2009 to 2011
diena_2009_to_2011 = diena_df[diena_df.Year_Numeric >= 2009][diena_df.Year_Numeric <= 2011]
diena_2009_to_2011.Year_Numeric.unique()
# there is another way of filtering multiple conditions with slightly hairier syntax
# Here is a link: https://kanoki.org/2020/01/21/pandas-dataframe-filter-with-multiple-conditions/

  


array([2011, 2010, 2009])

In [27]:
# How about using some text filters
# Let's look for all documents mentioning Rīga
d_Riga = diena_df[diena_df.Text.str.contains("Rīga")]
d_Riga.shape

(37, 7)

In [31]:
d_Riga.sample(5) # same as head or tails except gives us random sample of our rows

Unnamed: 0,Language,Source,Date,Text,CurrentYear,Year,Year_Numeric
4874,Latvian,diena.lv,2011/12/19,"Viņa minēja, ka visuzskatāmākais piemērs ir re...",2022,2011,2011
723,Latvian,diena.lv,2012/01/13,Klifs Grants ir Rīgas Zooloģiskajā dārzā mītoš...,2022,2012,2012
3795,Latvian,diena.lv,2012/01/09,"Kā pastāstīja Bite, saistībā ar gaidāmo ģimene...",2022,2012,2012
322,Latvian,diena.lv,2011/09/01,Kā aģentūru LETA informēja policijas Rīgas reģ...,2022,2011,2011
2343,Latvian,diena.lv,2011/10/31,"Svētdien Kārsavas muitas kontroles punktā, vei...",2022,2011,2011


In [32]:
first_doc_text = list(d_Riga.Text)[0] # text of first article in our sub corpus
first_doc_text

'Latvijas skatītāji Cirque du Soleil mākslu iepazīs šovā Saltimbanco, kas decembrī tiks demonstrēts Arēnā Rīga. Šīs programmas pirmizrāde notika 1992. gada 23. aprīlī Monreālā, 2007. gadā iestudējums tika atsvaidzināts. Ik gadu Saltimbanco viesojas 40 pilsētās. Iestudējums, kurā piedalās 50 mākslinieku, stāsta par transformācijām, kuras piedzīvo cilvēks, pārceļoties no provinces lielpilsētā. Ja gribat uzzināt par akrobātu un vingrotāju trikiem, kas tiek izpildīti šajā šovā, ieejiet Saltimbanco sadaļā Cirque du Soleil mājaslapā. Nemēģiniet tos atkārtot mājās!'

In [33]:
# We can use regular expression syntax to utilize more powerful searh capabilities
# https://regex101.com/ just one of the places to practice your regex skills
d_Riga_2 = diena_df[diena_df.Text.str.contains("Rīg[aāu]")] # so we find a Rīga AND Rīgā AND Rīgu
d_Riga_2.shape

(40, 7)

In [34]:
# We can exclude search terms
d_Riga_3 = d_Riga_2[~d_Riga_2.Text.str.contains("sport")] # so we exclude all rows which contains sport, notice ~
d_Riga_3.shape

(37, 7)

In [35]:
# How about combining our dataframes assuming they have same type of rows?
df_combined = pandas.concat([d_Riga, d_Riga_2, d_Riga_3], ignore_index = True) # index is not meaningful here so we drop it
df_combined.shape

(114, 7)

In [36]:
df_combined.head()

Unnamed: 0,Language,Source,Date,Text,CurrentYear,Year,Year_Numeric
0,Latvian,diena.lv,2011/12/13,Latvijas skatītāji Cirque du Soleil mākslu iep...,2022,2011,2011
1,Latvian,diena.lv,2011/09/01,Kā aģentūru LETA informēja policijas Rīgas reģ...,2022,2011,2011
2,Latvian,diena.lv,2010/07/30,To Dienai apliecināja divi šīs ballītes viesi ...,2022,2010,2010
3,Latvian,diena.lv,2012/01/13,Klifs Grants ir Rīgas Zooloģiskajā dārzā mītoš...,2022,2012,2012
4,Latvian,diena.lv,2011/09/28,"Taču vides draugi apgalvo, ka pie pašreizējiem...",2022,2011,2011


In [37]:
# we have quite a few duplicate rows here
# Let's drop them: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
df_unique_rows = df_combined.drop_duplicates()
df_unique_rows.shape

(40, 7)

In [38]:
# lets sort them by Date
df_sorted = df_unique_rows.sort_values(by="Date", ascending=True)
df_sorted

Unnamed: 0,Language,Source,Date,Text,CurrentYear,Year,Year_Numeric
31,Latvian,diena.lv,2009/10/22,"Ikgadējais čempionāts ""Latvijas lielākais ķirb...",2022,2009,2009
34,Latvian,diena.lv,2010/01/15,"Gada vārda, nevārda un spārnotā teiciena meklē...",2022,2010,2010
15,Latvian,diena.lv,2010/03/31,Līdz LBL pamatturnīra beigām atlicis aizvadīt ...,2022,2010,2010
18,Latvian,diena.lv,2010/05/21,Šo svētdien tas parādīsies Nordea Rīgas marato...,2022,2010,2010
2,Latvian,diena.lv,2010/07/30,To Dienai apliecināja divi šīs ballītes viesi ...,2022,2010,2010
9,Latvian,diena.lv,2010/08/31,"RTAB, kura izveidi savulaik iniciējis galvaspi...",2022,2010,2010
23,Latvian,diena.lv,2011/02/28,Saktastirgotāja Inta Koļesņikova A pauda bažas...,2022,2011,2011
11,Latvian,diena.lv,2011/05/13,"Tieto darbinieki ir aizrautīgi un aktīvi, ko a...",2022,2011,2011
19,Latvian,diena.lv,2011/05/26,Atbalstu fonda idejai apliecinājuši Romas kato...,2022,2011,2011
28,Latvian,diena.lv,2011/06/22,"Lielākā daļa ""Maxima X"" veikalu 23.jūnijā būs ...",2022,2011,2011


In [39]:
# now we can save our mini corpus
df_sorted.to_csv("diena_Rīga.csv", sep="\t") # we continue using tab for separation


In [None]:
# if we are running Jupyter notebooks locally we already have the file on our computer
# if we are using Google Colab then we need to download the File


# Your Task

## Open the big corpus (.zip files ) for your language (EE,LV,UA) as a starting point 
## Filter your own corpus of at least 100 rows

* As a minimum you should use some keyword matching for filters.
* Save the corpus as a .csv file using tab for separator
* Also Download the notebook (if you are on Colab)
* Submit Both notebook (.ipynb file) and corpus (.csv) (in order to receive ECTS credit points from the University of Latvia)
* Naming for .ipynb can remain same as Filtering Documents with Pandas.ipynb
* Naming for .csv should reflect your corpus for example: 
Spring_in_Tallinn.csv

[Submission Form](https://forms.gle/kaN7CGrtcP2UWEFw8)

PS If you do not use Google services (that is completed the assignment without Google Colab, and do not have Gmail account) then you can e-mail the files directly to valdis.s.coding at gmail com

In [40]:
diena_df.shape

(634, 7)

In [41]:
diena_100 = diena_df.sample(100) # could use head(100) or tail(100)
# for full credit do not just use this single command I want to see some filtering beforehand
diena_100.shape

(100, 7)

In [43]:
diena_100.to_csv("diena_100.csv", sep="\t")

In [None]:
# Your work starts here
# you submit this same notebook just with your code below
