# Filtering data
While generating the `product_prices_renamed.csv` file the following processing errors occurred and they need to be found:
1. In the **date** column, data from 1888 appeared - '1888-0',
2. Check the **date** column for similar errors but with future values.
3. In the **value** column, too high value was introduced – find it and locate the row where it is (use `query`),
4. There was a spelling error in the **product_types** column for one of the products. Find it and the corresponding rows. How many such rows are there?

You do not need to assign the results to the variable: it is enough if you display them.

> Based on the solution of this task, we will later correct all errors in the data.

Hints:

Subsection 2:
1. There is only one such value.
2. Use `loc` or `query` with the condition `date > '2020-1'`.

Subsection 3:
1. There is only one such value.
1. Do the following:
a) use `describe()`, to view percentiles,
b) use `loc` or `query` to find erroneous entries,

Subsection 4:

You can do it in the following way:
a) use `unique()` method to find all available values,
b) use `loc` or `query` to find erroneous entries,
c) The number of rows can be checked with the `shape` method.

In [11]:
import pandas as pd

In [12]:
df = pd.read_csv('../../01_Data/product_prices_renamed.csv',
  sep=';',
  encoding='UTF-8',
  decimal='.'
)

df.head()

Unnamed: 0,province,product_types,currency,product_group_id,product_line,value,date
0,SUBCARPATHIA,,PLN,2,pork ham cooked - per 1kg,21.37,2013-3
1,ŁÓDŹ,,PLN,4,bread - per 1kg,,2018-2
2,KUYAVIA-POMERANIA,,PLN,2,barley groats sausage - per 1kg,3.55,2019-12
3,LOWER SILESIA,,PLN,2,dressed chickens - per 1kg,6.14,2019-2
4,WARMIA-MASURIA,,PLN,2,Italian head cheese - per 1kg,5.63,2002-3


In [13]:
# Checking for future date values beyond '2024-01'
df.loc[df['date'] > '2024-1']


Unnamed: 0,province,product_types,currency,product_group_id,product_line,value,date
1583,HOLY CROSS,,PLN,2,haddock fillets frozen - per 1kg,0.0,2099-13
35525,PODLASKIE,,PLN,2,haddock fillets frozen - per 1kg,0.0,2099-13
43258,SILESIA,,PLN,2,haddock fillets frozen - per 1kg,20.96,2099-13
52595,LUBUSZ,,PLN,2,haddock fillets frozen - per 1kg,0.0,2099-13
56032,GREATER POLAND,,PLN,2,haddock fillets frozen - per 1kg,18.92,2099-13
72048,OPOLE,,PLN,2,haddock fillets frozen - per 1kg,0.0,2099-13
73532,ŁÓDŹ,,PLN,2,haddock fillets frozen - per 1kg,0.0,2099-13
84515,LUBLIN,,PLN,2,haddock fillets frozen - per 1kg,16.15,2099-13
86619,POMERANIA,,PLN,2,haddock fillets frozen - per 1kg,16.22,2099-13
98839,WEST POMERANIA,,PLN,2,haddock fillets frozen - per 1kg,0.0,2099-13


In [14]:
# Using describe() to view percentiles and identify the too high value
df['value'].describe()

count    137088.000000
mean          6.615227
std          34.112858
min           0.000000
25%           0.000000
50%           3.090000
75%          10.920000
max        3000.000000
Name: value, dtype: float64

In [15]:
# Using query to find the row with the too high value
# The highest value can be identified from the max value in the describe output

df.loc[df['value'] == 3000]

# df.loc[df['value'] == df['value'].max()]
# df.loc[df['value'] == df['value'].describe()['max']]
# df.query('value == {}'.format(df['value'].describe()['max']))


Unnamed: 0,province,product_types,currency,product_group_id,product_line,value,date
13724,POMERANIA,30% tomato concentrate - per 1kg,PLN,1,,3000.0,2003-1
18768,LOWER SILESIA,30% tomato concentrate - per 1kg,PLN,1,,3000.0,2003-1
22321,WEST POMERANIA,30% tomato concentrate - per 1kg,PLN,1,,3000.0,2003-1
31268,SUBCARPATHIA,30% tomato concentrate - per 1kg,PLN,1,,3000.0,2003-1
36958,LESSER POLAND,30% tomato concentrate - per 1kg,PLN,1,,3000.0,2003-1
62346,GREATER POLAND,30% tomato concentrate - per 1kg,PLN,1,,3000.0,2003-1
70851,MASOVIA,30% tomato concentrate - per 1kg,PLN,1,,3000.0,2003-1
79924,KUYAVIA-POMERANIA,30% tomato concentrate - per 1kg,PLN,1,,3000.0,2003-1
91306,ŁÓDŹ,30% tomato concentrate - per 1kg,PLN,1,,3000.0,2003-1
96597,OPOLE,30% tomato concentrate - per 1kg,PLN,1,,3000.0,2003-1


In [16]:
# Re-examining the unique values in the 'product_types' column to identify the spelling error
df['product_types'].unique()

array([nan, 'whole pickled cucumbers 0.9l - per 1pc.',
       'fresh chichen egges - per 666pcs.',
       '30% tomato concentrate - per 1kg',
       'frozen carrot and pea mix - per 1kg',
       'beet sugar white, bagged - per 1kg',
       'apple juice, boxed - per 1l', 'white table salt bagged - per 1kg',
       'natural chocolate plain - per 1kg'], dtype=object)

In [17]:
# Finding the number of rows with the spelling error in 'product_types' column
# The suspected error is in 'fresh chichen egges - per 666pcs.' (should be 'chicken eggs')

df.loc[df['product_types'] == 'fresh chichen egges - per 666pcs.']

Unnamed: 0,province,product_types,currency,product_group_id,product_line,value,date
6,WEST POMERANIA,fresh chichen egges - per 666pcs.,PLN,3,,0.00,2004-12
13,POLAND,fresh chichen egges - per 666pcs.,PLN,3,,0.27,2009-8
38,WARMIA-MASURIA,fresh chichen egges - per 666pcs.,PLN,3,,0.00,2003-12
202,GREATER POLAND,fresh chichen egges - per 666pcs.,PLN,3,,0.32,2018-8
229,WARMIA-MASURIA,fresh chichen egges - per 666pcs.,PLN,3,,0.00,2013-1
...,...,...,...,...,...,...,...
149691,POMERANIA,fresh chichen egges - per 666pcs.,PLN,3,,0.00,2019-3
149718,LUBUSZ,fresh chichen egges - per 666pcs.,PLN,3,,0.00,2009-1
149724,OPOLE,fresh chichen egges - per 666pcs.,PLN,3,,0.25,2011-12
149767,WARMIA-MASURIA,fresh chichen egges - per 666pcs.,PLN,3,,0.00,2001-3


In [18]:
df.loc[df['product_types'] == 'fresh chichen egges - per 666pcs.'].shape[0]

4284