# Cleaning Data with Pandas

A lot of data in realworld projects are going to be bad, out of place or missing. We use Pandas to clean up the data and get the dataset ready for the next step in the data pipeline. 

### Understanding the data

1. Understand the types used for the data
2. Aggregate data - min, max, mean, std deviation. It will help us spot anomaies.
3. Normaize data - Centering the data's distribution around zero or condensing the data into the range of 0 to 1. This is an important step in data processing in machine Learning. 
4. Transforming data - an entire column at a time
5. Filter the data - to only show certain rows and columns. 

# Viewing and Converting types



In [0]:
import pandas as pd

In [0]:
data = pd.read_csv("https://github.com/tategallery/collection/raw/master/artwork_data.csv")
data

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
2,1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785,1922.0,support: 343 x 467 mm,343,467,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
3,1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,394,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...
4,1039,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826,1919.0,image: 243 x 335 mm,243,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
5,1040,A00006,"Blake, William",artist,39,Ciampolo the Barrator Tormented by the Devils,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826,1919.0,image: 240 x 338 mm,240,338,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-ciam...
6,1041,A00007,"Blake, William",artist,39,The Baffled Devils Fighting,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826,1919.0,image: 242 x 334 mm,242,334,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
7,1042,A00008,"Blake, William",artist,39,The Six-Footed Serpent Attacking Agnolo Brunel...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826,1919.0,image: 246 x 340 mm,246,340,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
8,1043,A00009,"Blake, William",artist,39,The Serpent Attacking Buoso Donati,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826,1919.0,image: 241 x 335 mm,241,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
9,1044,A00010,"Blake, William",artist,39,The Pit of Disease: The Falsifiers,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826,1919.0,image: 243 x 340 mm,243,340,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...


In [0]:
# for simplicity, I'm selecting only a few rows in this dataset.

data = data[:10]
data

Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
2,1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,467,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
3,1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,394,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...
4,1039,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 335 mm,243,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
5,1040,A00006,"Blake, William",artist,39,Ciampolo the Barrator Tormented by the Devils,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 240 x 338 mm,240,338,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-ciam...
6,1041,A00007,"Blake, William",artist,39,The Baffled Devils Fighting,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 242 x 334 mm,242,334,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
7,1042,A00008,"Blake, William",artist,39,The Six-Footed Serpent Attacking Agnolo Brunel...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 246 x 340 mm,246,340,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
8,1043,A00009,"Blake, William",artist,39,The Serpent Attacking Buoso Donati,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 241 x 335 mm,241,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
9,1044,A00010,"Blake, William",artist,39,The Pit of Disease: The Falsifiers,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 340 mm,243,340,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...


In [0]:
data.dtypes

# object here means String.

id                      int64
accession_number       object
artist                 object
artistRole             object
artistId                int64
title                  object
dateText               object
medium                 object
creditLine             object
year                   object
acquisitionYear       float64
dimensions             object
width                  object
height                 object
depth                 float64
units                  object
inscription            object
thumbnailCopyright     object
thumbnailUrl           object
url                    object
dtype: object

In [0]:
data.acquisitionYear

0    1922.0
1    1922.0
2    1922.0
3    1922.0
4    1919.0
5    1919.0
6    1919.0
7    1919.0
8    1919.0
9    1919.0
Name: acquisitionYear, dtype: float64

In [0]:
data.acquisitionYear.astype(int)      

0    1922
1    1922
2    1922
3    1922
4    1919
5    1919
6    1919
7    1919
8    1919
9    1919
Name: acquisitionYear, dtype: int64

In [0]:
# the original column remains a float tho. astype() doesn't change the original data. Rather, it returns a brand new series.
data.acquisitionYear

0    1922.0
1    1922.0
2    1922.0
3    1922.0
4    1919.0
5    1919.0
6    1919.0
7    1919.0
8    1919.0
9    1919.0
Name: acquisitionYear, dtype: float64

In [0]:
# To set it permanently

data.acquisitionYear = data.acquisitionYear.astype(int)
data.acquisitionYear       

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


0    1922
1    1922
2    1922
3    1922
4    1919
5    1919
6    1919
7    1919
8    1919
9    1919
Name: acquisitionYear, dtype: int64

In [0]:
# changing back to float

data.acquisitionYear = data.acquisitionYear.astype(float)
data.acquisitionYear

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


0    1922.0
1    1922.0
2    1922.0
3    1922.0
4    1919.0
5    1919.0
6    1919.0
7    1919.0
8    1919.0
9    1919.0
Name: acquisitionYear, dtype: float64

In [0]:
fullData = pd.read_csv("https://github.com/tategallery/collection/raw/master/artwork_data.csv", low_memory = False)
fullData.head()

Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
2,1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,467,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
3,1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,394,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...
4,1039,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 335 mm,243,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...


In [0]:
# When read_csv() reads the csv, it has to guess the data types of the columns. When it encounters bad values, it can't set dtype. Hence, it assignes object as dtype.

fullData.dtypes

id                      int64
accession_number       object
artist                 object
artistRole             object
artistId                int64
title                  object
dateText               object
medium                 object
creditLine             object
year                   object
acquisitionYear       float64
dimensions             object
width                  object
height                 object
depth                 float64
units                  object
inscription            object
thumbnailCopyright     object
thumbnailUrl           object
url                    object
dtype: object

In [0]:
fullData.height.astype(float)

ValueError: ignored

In [0]:
# It gives an error as it has atleast one element as "mm" which it cannot parse as a float. This, if it cannot guess what the number was, wil convert it to nan.

pd.to_numeric(fullData.height, errors = 'coerce')

0         419.0
1         213.0
2         467.0
3         394.0
4         335.0
5         338.0
6         334.0
7         340.0
8         335.0
9         340.0
10        340.0
11        150.0
12        151.0
13        153.0
14        152.0
15        152.0
16        153.0
17        153.0
18        150.0
19        152.0
20        152.0
21        152.0
22        151.0
23        151.0
24        150.0
25        151.0
26        150.0
27        151.0
28        150.0
29        150.0
          ...  
69171     760.0
69172     566.0
69173     762.0
69174       NaN
69175       NaN
69176    1679.0
69177       NaN
69178       NaN
69179       NaN
69180       NaN
69181       NaN
69182       NaN
69183       NaN
69184     267.0
69185     470.0
69186     205.0
69187     317.0
69188       NaN
69189      57.0
69190       NaN
69191       NaN
69192    1155.0
69193       NaN
69194     305.0
69195     305.0
69196     305.0
69197     305.0
69198    2410.0
69199       NaN
69200     660.0
Name: height, Length: 69

In [0]:
# making the dtype change permanent

fullData.height = pd.to_numeric(fullData.height, errors = 'coerce')
fullData.height.dtype

dtype('float64')

# Aggregating Data

NaNs are automatically excluded from the aggregate functions. We can even apply aggregate functions on strings. But not every function will work. sum() concatnates all teh strings. 

In [0]:
data = fullData[:10]
data

Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
2,1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,467.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
3,1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,394.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...
4,1039,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 335 mm,243,335.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
5,1040,A00006,"Blake, William",artist,39,Ciampolo the Barrator Tormented by the Devils,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 240 x 338 mm,240,338.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-ciam...
6,1041,A00007,"Blake, William",artist,39,The Baffled Devils Fighting,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 242 x 334 mm,242,334.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
7,1042,A00008,"Blake, William",artist,39,The Six-Footed Serpent Attacking Agnolo Brunel...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 246 x 340 mm,246,340.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
8,1043,A00009,"Blake, William",artist,39,The Serpent Attacking Buoso Donati,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 241 x 335 mm,241,335.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
9,1044,A00010,"Blake, William",artist,39,The Pit of Disease: The Falsifiers,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 340 mm,243,340.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...


In [0]:
data.acquisitionYear

0    1922.0
1    1922.0
2    1922.0
3    1922.0
4    1919.0
5    1919.0
6    1919.0
7    1919.0
8    1919.0
9    1919.0
Name: acquisitionYear, dtype: float64

In [0]:
data.acquisitionYear.min()

1919.0

In [0]:
data.acquisitionYear.max()

1922.0

In [0]:
data.acquisitionYear.mean()

1920.2

In [0]:
data.acquisitionYear.std()

1.5491933384829668

In [0]:
# aggregate function agg() will apply the aggregate functions across the dataframe.

data.agg('min')

id                                                                 1035
accession_number                                                 A00001
artist                                                    Blake, Robert
artistRole                                                       artist
artistId                                                             38
title                 A Figure Bowing before a Seated Old Man with h...
dateText                                         1826–7, reprinted 1892
medium                                                Graphite on paper
creditLine                          Presented by Mrs John Richmond 1922
acquisitionYear                                                    1919
dimensions                                          image: 240 x 338 mm
width                                                               240
height                                                              213
depth                                                           

In [0]:
# doing multiple aggregations

data.agg(['min', 'max', 'mean', 'std'])

Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
min,1035.0,A00001,"Blake, Robert",artist,38.0,A Figure Bowing before a Seated Old Man with h...,"1826–7, reprinted 1892",Graphite on paper,Presented by Mrs John Richmond 1922,1919.0,image: 240 x 338 mm,240.0,213.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
max,1044.0,A00010,"Blake, William",artist,39.0,"Two Drawings of Frightened Figures, Probably f...",date not known,"Watercolour, ink, chalk and graphite on paper....",Purchased with the assistance of a special gra...,1922.0,support: 394 x 419 mm,394.0,467.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
mean,1039.5,,,,38.6,,,,,1920.2,,3.9431099999999997e+28,351.5,,,,,,
std,3.02765,,,,0.516398,,,,,1.549193,,,66.818577,,,,,,


# Normalizing Data

THe columns we're gonna normalize is either a float or an int. Otherwis, it doesn't make sense. Before normalization, do all he aggregation functions. min, max, mean, std. Normalization means to adjust the values in a column and change their scae. There are so many ways to normalize data. 

1. subtract the mean and divide by the standard deviation.
2. Normalize the elements between zero and one.

In [0]:
# this is a float, so can normaize.
data.height.dtype

dtype('float64')

In [0]:
data.height.min()

213.0

In [0]:
data.height.max()

467.0

In [0]:
data.height.mean()

351.5

In [0]:
data.height.std()

66.81857692455162

In [0]:
height = data.height
height

0    419.0
1    213.0
2    467.0
3    394.0
4    335.0
5    338.0
6    334.0
7    340.0
8    335.0
9    340.0
Name: height, dtype: float64

In [0]:
# This is called Standardization in Statistics.

norm = (height - height.mean()) / height.std()
norm

0    1.010198
1   -2.072777
2    1.728561
3    0.636051
4   -0.246937
5   -0.202040
6   -0.261903
7   -0.172108
8   -0.246937
9   -0.172108
Name: height, dtype: float64

In [0]:
# To get the values between zero and one

minmax = (height - height.min())/(height.max() - height.min())
minmax

0    0.811024
1    0.000000
2    1.000000
3    0.712598
4    0.480315
5    0.492126
6    0.476378
7    0.500000
8    0.480315
9    0.500000
Name: height, dtype: float64

In [0]:
minmax.min()

0.0

In [0]:
minmax.max()

1.0

In [0]:
# seeing the min() and max() results, we've successfully normalized the values between zero and 1.  Now, setting the normalized heights to the dataframe.
data.height = minmax
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,0.811024,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,0.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
2,1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,1.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
3,1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,0.712598,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...
4,1039,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 335 mm,243,0.480315,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
5,1040,A00006,"Blake, William",artist,39,Ciampolo the Barrator Tormented by the Devils,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 240 x 338 mm,240,0.492126,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-ciam...
6,1041,A00007,"Blake, William",artist,39,The Baffled Devils Fighting,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 242 x 334 mm,242,0.476378,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
7,1042,A00008,"Blake, William",artist,39,The Six-Footed Serpent Attacking Agnolo Brunel...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 246 x 340 mm,246,0.5,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
8,1043,A00009,"Blake, William",artist,39,The Serpent Attacking Buoso Donati,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 241 x 335 mm,241,0.480315,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
9,1044,A00010,"Blake, William",artist,39,The Pit of Disease: The Falsifiers,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 340 mm,243,0.5,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...


In [0]:
# Adding a new column to the dataframe

data['standardizedHeights'] = norm
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url,standardized_heights,standardizedHeights
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,0.811024,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...,1.010198,1.010198
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,0.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...,-2.072777,-2.072777
2,1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,1.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...,1.728561,1.728561
3,1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,0.712598,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...,0.636051,0.636051
4,1039,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 335 mm,243,0.480315,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...,-0.246937,-0.246937
5,1040,A00006,"Blake, William",artist,39,Ciampolo the Barrator Tormented by the Devils,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 240 x 338 mm,240,0.492126,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-ciam...,-0.20204,-0.20204
6,1041,A00007,"Blake, William",artist,39,The Baffled Devils Fighting,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 242 x 334 mm,242,0.476378,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...,-0.261903,-0.261903
7,1042,A00008,"Blake, William",artist,39,The Six-Footed Serpent Attacking Agnolo Brunel...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 246 x 340 mm,246,0.5,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...,-0.172108,-0.172108
8,1043,A00009,"Blake, William",artist,39,The Serpent Attacking Buoso Donati,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 241 x 335 mm,241,0.480315,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...,-0.246937,-0.246937
9,1044,A00010,"Blake, William",artist,39,The Pit of Disease: The Falsifiers,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 340 mm,243,0.5,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...,-0.172108,-0.172108


# Transforming Data

transform() function takes a function and returns what ever the function returns. 

In [0]:
df = pd.read_csv("https://github.com/tategallery/collection/raw/master/artwork_data.csv")
data = df[:10]


  interactivity=interactivity, compiler=compiler, result=result)


In [0]:
data.height

0    419
1    213
2    467
3    394
4    335
5    338
6    334
7    340
8    335
9    340
Name: height, dtype: object

In [0]:
# Transforming height from mm to cm

data.height = data.height.transform(lambda x: x / 10)
data.height

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


0    41.9
1    21.3
2    46.7
3    39.4
4    33.5
5    33.8
6    33.4
7    34.0
8    33.5
9    34.0
Name: height, dtype: float64

In [0]:
data.groupby('artist')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ffb9aa90860>

In [0]:
# this will count the no. of unique row and unique columns per row and column.
# for every 4 works by Blake, there are 3 mediums used.
# for every 6 works of Blake, only one medium is used.

data.groupby('artist').transform('nunique')

Unnamed: 0,id,accession_number,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
0,4,4,1,1,4,2,3,1,1,1,4,4,4,0,1,0,0,4,4
1,4,4,1,1,4,2,3,1,1,1,4,4,4,0,1,0,0,4,4
2,4,4,1,1,4,2,3,1,1,1,4,4,4,0,1,0,0,4,4
3,4,4,1,1,4,2,3,1,1,1,4,4,4,0,1,0,0,4,4
4,6,6,1,1,6,1,1,1,1,1,6,5,4,0,1,0,0,6,6
5,6,6,1,1,6,1,1,1,1,1,6,5,4,0,1,0,0,6,6
6,6,6,1,1,6,1,1,1,1,1,6,5,4,0,1,0,0,6,6
7,6,6,1,1,6,1,1,1,1,1,6,5,4,0,1,0,0,6,6
8,6,6,1,1,6,1,1,1,1,1,6,5,4,0,1,0,0,6,6
9,6,6,1,1,6,1,1,1,1,1,6,5,4,0,1,0,0,6,6


In [0]:
# mean of teh height column per artist

data.groupby('artist')['height'].transform('mean')

0    37.325
1    37.325
2    37.325
3    37.325
4    33.700
5    33.700
6    33.700
7    33.700
8    33.700
9    33.700
Name: height, dtype: float64

In [0]:
data.artist

0     Blake, Robert
1     Blake, Robert
2     Blake, Robert
3     Blake, Robert
4    Blake, William
5    Blake, William
6    Blake, William
7    Blake, William
8    Blake, William
9    Blake, William
Name: artist, dtype: object

In [0]:
data['meanHeightOfArtist'] = data.groupby('artist')['height'].transform('mean')
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url,meanHeightOfArtist
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,41.9,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...,37.325
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,21.3,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...,37.325
2,1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,46.7,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...,37.325
3,1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,39.4,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...,37.325
4,1039,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 335 mm,243,33.5,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...,33.7
5,1040,A00006,"Blake, William",artist,39,Ciampolo the Barrator Tormented by the Devils,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 240 x 338 mm,240,33.8,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-ciam...,33.7
6,1041,A00007,"Blake, William",artist,39,The Baffled Devils Fighting,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 242 x 334 mm,242,33.4,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...,33.7
7,1042,A00008,"Blake, William",artist,39,The Six-Footed Serpent Attacking Agnolo Brunel...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 246 x 340 mm,246,34.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...,33.7
8,1043,A00009,"Blake, William",artist,39,The Serpent Attacking Buoso Donati,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 241 x 335 mm,241,33.5,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...,33.7
9,1044,A00010,"Blake, William",artist,39,The Pit of Disease: The Falsifiers,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 340 mm,243,34.0,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...,33.7


# Filtering Data


In [0]:
import pandas as pd

df = pd.read_csv('https://github.com/tategallery/collection/raw/master/artwork_data.csv', low_memory = False)
df.head()

Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
2,1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,467,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
3,1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,394,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...
4,1039,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 335 mm,243,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...


In [0]:
df.filter(items = ['id', 'artist', 'title']).head()

Unnamed: 0,id,artist,title
0,1035,"Blake, Robert",A Figure Bowing before a Seated Old Man with h...
1,1036,"Blake, Robert","Two Drawings of Frightened Figures, Probably f..."
2,1037,"Blake, Robert",The Preaching of Warning. Verso: An Old Man En...
3,1038,"Blake, Robert",Six Drawings of Figures with Outstretched Arms
4,1039,"Blake, William",The Circle of the Lustful: Francesca da Rimini...


In [0]:
# cast sensitive
df.filter(like = 'year').head()

Unnamed: 0,year
0,
1,
2,1785.0
3,
4,1826.0


# Removing and fixing columns with Pandas

We could do any or all of the following:
1. Droping columns
2. Change case
3. Rename columns

Why fix columns? 

1. Collaboration - naming clarity when datasets are shared
2. Interaction - when you use it in your API
3. Big Data - Not to loose data in large systems.

# Dropping Columns

In [0]:
import pandas as pd
df = pd.read_csv("https://github.com/tategallery/collection/raw/master/artwork_data.csv")
data = df[:10]

  interactivity=interactivity, compiler=compiler, result=result)


In [0]:
data[:2]

Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...


In [0]:
data.drop(0)

Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
2,1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,467,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
3,1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,394,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...
4,1039,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 335 mm,243,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
5,1040,A00006,"Blake, William",artist,39,Ciampolo the Barrator Tormented by the Devils,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 240 x 338 mm,240,338,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-ciam...
6,1041,A00007,"Blake, William",artist,39,The Baffled Devils Fighting,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 242 x 334 mm,242,334,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
7,1042,A00008,"Blake, William",artist,39,The Six-Footed Serpent Attacking Agnolo Brunel...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 246 x 340 mm,246,340,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
8,1043,A00009,"Blake, William",artist,39,The Serpent Attacking Buoso Donati,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 241 x 335 mm,241,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
9,1044,A00010,"Blake, William",artist,39,The Pit of Disease: The Falsifiers,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 340 mm,243,340,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...


In [0]:
data.drop(columns = ['id'])

Unnamed: 0,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
0,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
2,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,467,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
3,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,394,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...
4,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 335 mm,243,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
5,A00006,"Blake, William",artist,39,Ciampolo the Barrator Tormented by the Devils,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 240 x 338 mm,240,338,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-ciam...
6,A00007,"Blake, William",artist,39,The Baffled Devils Fighting,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 242 x 334 mm,242,334,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
7,A00008,"Blake, William",artist,39,The Six-Footed Serpent Attacking Agnolo Brunel...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 246 x 340 mm,246,340,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
8,A00009,"Blake, William",artist,39,The Serpent Attacking Buoso Donati,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 241 x 335 mm,241,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
9,A00010,"Blake, William",artist,39,The Pit of Disease: The Falsifiers,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 340 mm,243,340,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...


In [0]:
data.drop(labels = [0, 1, 2])

Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
3,1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,394,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...
4,1039,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 335 mm,243,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
5,1040,A00006,"Blake, William",artist,39,Ciampolo the Barrator Tormented by the Devils,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 240 x 338 mm,240,338,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-ciam...
6,1041,A00007,"Blake, William",artist,39,The Baffled Devils Fighting,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 242 x 334 mm,242,334,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
7,1042,A00008,"Blake, William",artist,39,The Six-Footed Serpent Attacking Agnolo Brunel...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 246 x 340 mm,246,340,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
8,1043,A00009,"Blake, William",artist,39,The Serpent Attacking Buoso Donati,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 241 x 335 mm,241,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
9,1044,A00010,"Blake, William",artist,39,The Pit of Disease: The Falsifiers,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 340 mm,243,340,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...


In [0]:
# everything we dropped are stil there. To make them permanent, we need to set inplace = True in the drop() method.
data

Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
2,1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,467,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
3,1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,394,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...
4,1039,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 335 mm,243,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
5,1040,A00006,"Blake, William",artist,39,Ciampolo the Barrator Tormented by the Devils,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 240 x 338 mm,240,338,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-ciam...
6,1041,A00007,"Blake, William",artist,39,The Baffled Devils Fighting,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 242 x 334 mm,242,334,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
7,1042,A00008,"Blake, William",artist,39,The Six-Footed Serpent Attacking Agnolo Brunel...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 246 x 340 mm,246,340,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
8,1043,A00009,"Blake, William",artist,39,The Serpent Attacking Buoso Donati,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 241 x 335 mm,241,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
9,1044,A00010,"Blake, William",artist,39,The Pit of Disease: The Falsifiers,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 340 mm,243,340,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...


In [0]:
data.drop(columns = ['id'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [0]:
data

Unnamed: 0,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
0,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
2,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,467,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
3,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,394,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...
4,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 335 mm,243,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
5,A00006,"Blake, William",artist,39,Ciampolo the Barrator Tormented by the Devils,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 240 x 338 mm,240,338,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-ciam...
6,A00007,"Blake, William",artist,39,The Baffled Devils Fighting,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 242 x 334 mm,242,334,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
7,A00008,"Blake, William",artist,39,The Six-Footed Serpent Attacking Agnolo Brunel...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 246 x 340 mm,246,340,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
8,A00009,"Blake, William",artist,39,The Serpent Attacking Buoso Donati,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 241 x 335 mm,241,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
9,A00010,"Blake, William",artist,39,The Pit of Disease: The Falsifiers,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 340 mm,243,340,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...


In [0]:
  # To import only the required columns
  
  df = pd.read_csv("https://github.com/tategallery/collection/raw/master/artwork_data.csv", usecols = ['id', 'artist', 'title'])
  df.head(2)

Unnamed: 0,id,artist,title
0,1035,"Blake, Robert",A Figure Bowing before a Seated Old Man with h...
1,1036,"Blake, Robert","Two Drawings of Frightened Figures, Probably f..."


# Changing Column Casing

In [0]:
df = pd.read_csv("https://github.com/tategallery/collection/raw/master/artwork_data.csv")
df.head(2)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...


In [0]:
df.columns

Index(['id', 'accession_number', 'artist', 'artistRole', 'artistId', 'title',
       'dateText', 'medium', 'creditLine', 'year', 'acquisitionYear',
       'dimensions', 'width', 'height', 'depth', 'units', 'inscription',
       'thumbnailCopyright', 'thumbnailUrl', 'url'],
      dtype='object')

In [0]:
# There are different types of capitalization in the column names. We need to clean them all 

df.columns = df.columns.str.lower()
df.columns

Index(['id', 'accession_number', 'artist', 'artistrole', 'artistid', 'title',
       'datetext', 'medium', 'creditline', 'year', 'acquisitionyear',
       'dimensions', 'width', 'height', 'depth', 'units', 'inscription',
       'thumbnailcopyright', 'thumbnailurl', 'url'],
      dtype='object')

In [0]:
# alternate way

df.columns = [x.upper() for x in df.columns]
df.columns

Index(['ID', 'ACCESSION_NUMBER', 'ARTIST', 'ARTISTROLE', 'ARTISTID', 'TITLE',
       'DATETEXT', 'MEDIUM', 'CREDITLINE', 'YEAR', 'ACQUISITIONYEAR',
       'DIMENSIONS', 'WIDTH', 'HEIGHT', 'DEPTH', 'UNITS', 'INSCRIPTION',
       'THUMBNAILCOPYRIGHT', 'THUMBNAILURL', 'URL'],
      dtype='object')

In [0]:
# yet another alternate way

df.columns = map(lambda x: x.lower(), df.columns)
df.columns

Index(['id', 'accession_number', 'artist', 'artistrole', 'artistid', 'title',
       'datetext', 'medium', 'creditline', 'year', 'acquisitionyear',
       'dimensions', 'width', 'height', 'depth', 'units', 'inscription',
       'thumbnailcopyright', 'thumbnailurl', 'url'],
      dtype='object')

# Renaming Columns

In [0]:
# inplace  =True sets the change to teh origina array
df.rename(columns = {'thumbnailurl': 'thumbnail'}, inplace = True)
df.head(2)

Unnamed: 0,id,accession_number,artist,artistrole,artistid,title,datetext,medium,creditline,year,acquisitionyear,dimensions,width,height,depth,units,inscription,thumbnailcopyright,thumbnail,url
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419,,mm,,,1035,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,1036,http://www.tate.org.uk/art/artworks/blake-two-...


In [0]:
df.rename(columns = lambda x: x.lower(), inplace = True)

In [0]:
df.columns = ['a', 'b', 'c', 'artistrole', 'artistid', 'title',
       'datetext', 'medium', 'creditline', 'year', 'acquisitionyear',
       'dimensions', 'width', 'height', 'depth', 'units', 'inscription',
       'thumbnailcopyright', 'thumbnailurl', 'url']
df.columns

Index(['a', 'b', 'c', 'artistrole', 'artistid', 'title', 'datetext', 'medium',
       'creditline', 'year', 'acquisitionyear', 'dimensions', 'width',
       'height', 'depth', 'units', 'inscription', 'thumbnailcopyright',
       'thumbnailurl', 'url'],
      dtype='object')

In [0]:
# changing names right when reading the csv. Must add header = 0 when we give the names parameter

df = pd.read_csv("https://github.com/tategallery/collection/raw/master/artwork_data.csv", names = ['id', 'accessionNumber', 'artist', 'artistRole', 'artistId', 'title',
       'dateText', 'medium', 'creditLine', 'year', 'acquisitionYear',
       'dimensions', 'width', 'height', 'depth', 'units', 'inscription',
       'thumbnailCopyRight', 'thumbnailUrl', 'url'], header = 0)
df.columns

  interactivity=interactivity, compiler=compiler, result=result)


Index(['id', 'accessionNumber', 'artist', 'artistRole', 'artistId', 'title',
       'dateText', 'medium', 'creditLine', 'year', 'acquisitionYear',
       'dimensions', 'width', 'height', 'depth', 'units', 'inscription',
       'thumbnailCopyRight', 'thumbnailUrl', 'url'],
      dtype='object')

# Indexing and Filtering the dataset

We can eleminate unwanted details in the data and narrow down the data to only the things you want.

## Direct Filtering with Square brackets

In [0]:
df.head()['id']

0    1035
1    1036
2    1037
3    1038
4    1039
Name: id, dtype: int64

In [0]:
# We cannot access rows of a dataframe using []
# till index 4
df[['id', 'artist']][:4]

Unnamed: 0,id,artist
0,1035,"Blake, Robert"
1,1036,"Blake, Robert"
2,1037,"Blake, Robert"
3,1038,"Blake, Robert"


In [0]:
# can access a single row in a dataframe with a range tho

df[['id', 'medium']][1:2]

Unnamed: 0,id,medium
1,1036,Graphite on paper


In [0]:
df['year'].dtype

dtype('O')

# Data Indexing with .loc

Takes in row, col. If you want more of each, you can slice with the : operator. [] gets the values by teh integer positions. loc[] gets the values by the label.

loc[] is inclusive on he ending. [0:3] gives 0, 1, 2 AND 3

In [0]:
# row 2 and all the coumns
df.loc[2,:]

id                                                                 1037
accessionNumber                                                  A00003
artist                                                    Blake, Robert
artistRole                                                       artist
artistId                                                             38
dateText                                                        ?c.1785
medium                      Graphite on paper. Verso: graphite on paper
creditLine                          Presented by Mrs John Richmond 1922
year                                                               1785
acquisitionYear                                                    1922
dimensions                                        support: 343 x 467 mm
width                                                               343
height                                                              467
depth                                                           

In [0]:
# rows 1 to 3 and all cols
df.loc[1:4, :]

Unnamed: 0,id,accessionNumber,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyRight,thumbnailUrl,url
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
2,1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,467,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
3,1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,394,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...
4,1039,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 335 mm,243,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...


In [0]:
# since loc[] operates on labels, can't do slicing in the columns. Have to specify the labels.
df.loc[0:2, ['artist', 'creditLine']]

Unnamed: 0,artist,creditLine
0,"Blake, Robert",Presented by Mrs John Richmond 1922
1,"Blake, Robert",Presented by Mrs John Richmond 1922
2,"Blake, Robert",Presented by Mrs John Richmond 1922


In [0]:
# only the rows we asked for
df.loc[[33, 55, 66], ['artist', 'medium']]

Unnamed: 0,artist,medium
33,"Blake, William",Relief etching and watercolour on paper
55,British School 18th century,Watercolour on paper
66,"Burne-Jones, Sir Edward Coley, Bt",Graphite on paper


In [0]:
# slice of columns
df.loc[1:3, 'id':'dateText']

Unnamed: 0,id,accessionNumber,artist,artistRole,artistId,title,dateText
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known
2,1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785
3,1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known


In [0]:
# can even do filtering with loc
df.loc[df['artist'] == 'Blake, Robert'].head(2)

Unnamed: 0,id,accessionNumber,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyRight,thumbnailUrl,url
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...


In [0]:
# alternatively

df.loc[df.artist == 'Blake, Robert'].head(2)

Unnamed: 0,id,accessionNumber,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyRight,thumbnailUrl,url
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...


In [0]:
# artistRole and medium cols with artist == blake

df.loc[df.artist == 'Blake, Robert', ['artistRole', 'medium']]

Unnamed: 0,artistRole,medium
0,artist,"Watercolour, ink, chalk and graphite on paper...."
1,artist,Graphite on paper
2,artist,Graphite on paper. Verso: graphite on paper
3,artist,Graphite on paper


# iloc to access specific rows and columns

iloc uses inetger position instead of the labels.

In [0]:
df.iloc[0:2, 1:3]

Unnamed: 0,accessionNumber,artist
0,A00001,"Blake, Robert"
1,A00002,"Blake, Robert"


In [0]:
df.set_index('id', inplace = True)

In [0]:
df.iloc[0:2, 1:3]

Unnamed: 0_level_0,artist,artistRole
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1035,"Blake, Robert",artist
1036,"Blake, Robert",artist


# Filtering data with  str.contains

In [0]:
data = df[0:10]

In [0]:
# since it is a Series, we can apply str methods.
type(data.medium)

pandas.core.series.Series

In [0]:
data.medium

id
1035    Watercolour, ink, chalk and graphite on paper....
1036                                    Graphite on paper
1037          Graphite on paper. Verso: graphite on paper
1038                                    Graphite on paper
1039                              Line engraving on paper
1040                              Line engraving on paper
1041                              Line engraving on paper
1042                              Line engraving on paper
1043                              Line engraving on paper
1044                              Line engraving on paper
Name: medium, dtype: object

In [0]:
data.medium.str.contains('Graphite')

id
1035    False
1036     True
1037     True
1038     True
1039    False
1040    False
1041    False
1042    False
1043    False
1044    False
Name: medium, dtype: bool

In [0]:
# returns only the rows that holds true

data.loc[data.medium.str.contains('Graphite')]

Unnamed: 0_level_0,accessionNumber,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyRight,thumbnailUrl,url
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,467,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,394,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...


In [0]:
# year is not a string.

data.acquisitionYear.astype(str).str.contains('1922')

id
1035     True
1036     True
1037     True
1038     True
1039    False
1040    False
1041    False
1042    False
1043    False
1044    False
Name: acquisitionYear, dtype: bool

In [0]:
data.loc[data.acquisitionYear.astype(str).str.contains('1922')]

Unnamed: 0_level_0,accessionNumber,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyRight,thumbnailUrl,url
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,467,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,394,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...


# Handling Bad, Missing and Duplicate Data

This is the last step we could do to clean our data to feed into our machine learning models. There are many types of errors we need to fix.

  1. Figure out what is bad data - missing data, values that haven't been parsed correctly or something else entirely. 
  2. Define the goal - API or machine learning model or data science.
  
  When you come across bad data, you either drop, fill or replace 'em with some specific data.
  Large dataset has large data problems. 
  
  # Stripping White Space

In [0]:
import pandas as pd

df = pd.read_csv("https://github.com/tategallery/collection/raw/master/artwork_data.csv", low_memory = False)

In [17]:
# detecting white space

df.loc[df.title.str.contains('\s$', regex = True)]

# These titles have atleast one white space.

Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
49498,4308,P07466,"Hamilton Finlay, Ian",artist,1093,Port-Distinguishing Letters of Scottish Fishin...,1976,Screenprint on ceramic tile,Purchased 1981,1976,1981.0,unconfirmed: 153 x 153 mm,153,153,,mm,,© Estate of Ian Hamilton Finlay,http://www.tate.org.uk/art/images/work/P/P07/P...,http://www.tate.org.uk/art/artworks/hamilton-f...
50534,2235,P11065,"Clarke, Brian",artist,911,Boys,1981,Screenprint on paper,Presented by Paul Beldock 1983,1981,1983.0,image: 1003 x 700 mm,1003,700,,mm,date inscribed,© Brian Clarke. All Rights Reserved 2014 / DACS,http://www.tate.org.uk/art/images/work/P/P11/P...,http://www.tate.org.uk/art/artworks/clarke-boy...
50535,2236,P11066,"Clarke, Brian",artist,911,Buildings,1981,Screenprint on paper,Presented by Paul Beldock 1983,1981,1983.0,image: 1005 x 698 mm,1005,698,,mm,date inscribed,© Brian Clarke. All Rights Reserved 2014 / DACS,http://www.tate.org.uk/art/images/work/P/P11/P...,http://www.tate.org.uk/art/artworks/clarke-bui...
50537,2238,P11068,"Clarke, Brian",artist,911,Pray for Josquin,1981,Screenprint on paper,Presented by Paul Beldock 1983,1981,1983.0,image: 1115 x 688 mm,1115,688,,mm,date inscribed,© Brian Clarke. All Rights Reserved 2014 / DACS,http://www.tate.org.uk/art/images/work/P/P11/P...,http://www.tate.org.uk/art/artworks/clarke-pra...
53186,21168,P77679,"Bourgeois, Louise",artist,2351,Untitled (Safety Pins),1991,Drypoint on paper,Purchased 1994,1991,1994.0,image: 303 x 379 mm,303,379,,mm,,© The Easton Foundation,http://www.tate.org.uk/art/images/work/P/P77/P...,http://www.tate.org.uk/art/artworks/bourgeois-...
56283,5826,T00705,"Hamilton, Richard",artist,1244,Towards a definitive statement on the coming t...,1962,"Oil paint, cellulose paint and printed paper o...",Purchased 1964,1962,1964.0,support: 610 x 813 mm frame: 809 x 1011 x 810 mm,610,813,,mm,date inscribed,© The estate of Richard Hamilton,http://www.tate.org.uk/art/images/work/T/T00/T...,http://www.tate.org.uk/art/artworks/hamilton-t...
67409,88191,T12064,"Leach, David",artist,7651,2 Standard Ware Mead Cups,1945–55,Ochre porcelain,Accepted by HM Government in lieu of inheritan...,1945,2005.0,object: 70 x 80 x 80 mm,70,80,80.0,mm,,© The estate of Bernard Leach,http://www.tate.org.uk/art/images/work/T/T12/T...,http://www.tate.org.uk/art/artworks/leach-2-st...
67432,88215,T12087,"Leach, Bernard",artist,1478,Bowl,c.1960,Porcelain,Accepted by HM Government in lieu of inheritan...,1960,2005.0,object: 50 x 140 x 140 mm,50,140,140.0,mm,,© The estate of Bernard Leach,http://www.tate.org.uk/art/images/work/T/T12/T...,http://www.tate.org.uk/art/artworks/leach-bowl...


In [18]:
df.title = df.title.str.strip()
df.loc[df.title.str.contains('\s$', regex = True)]

# 

Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url


In [0]:
# strips right side of the string
df.title = df.title.str.rstrip()
df.title = df.title.str.lstrip()

In [20]:
# or use transform function to strip.

df.title.transform(lambda x: x.strip())

0        A Figure Bowing before a Seated Old Man with h...
1        Two Drawings of Frightened Figures, Probably f...
3           Six Drawings of Figures with Outstretched Arms
4        The Circle of the Lustful: Francesca da Rimini...
5            Ciampolo the Barrator Tormented by the Devils
6                              The Baffled Devils Fighting
7        The Six-Footed Serpent Attacking Agnolo Brunel...
8                       The Serpent Attacking Buoso Donati
9                       The Pit of Disease: The Falsifiers
10                Dante Striking against Bocca Degli Abati
11                                      Job and his Family
12                          Satan before the Throne of God
13           Job’s Sons and Daughters Overwhelmed by Satan
14              The Messengers tell Job of his Misfortunes
15       Satan Going Forth from the Presence of the Lor...
16                       Satan Smiting Job with Sore Boils
17                                        Job’s Comforte

# Replacing Bad Data with NaN

In [21]:
import numpy as np

df.loc[:, ['dateText']].head(3)

# need to replace "date not known" to np.nan

Unnamed: 0,dateText
0,date not known
1,date not known
2,?c.1785


In [0]:
df.replace({"dateText": {"date not known": np.nan}}, inplace = True)

In [23]:
df.loc[:, ['dateText']].head(3)

# NaN is replaced.

Unnamed: 0,dateText
0,
1,
2,?c.1785


In [24]:
df.loc[df.year.notnull() & df.year.astype(str).str.contains('[^0-9]')].head(3)

# there is "no date" in the year column

Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
67968,99332,T12629,"Roberts, William",artist,1855,Families on a beach,no date,Graphite on paper,Accepted by HM Government in lieu of inheritan...,no date,2008.0,support: 127 x 178 mm,127.0,178.0,,mm,,© The estate of William Roberts,http://www.tate.org.uk/art/images/work/T/T12/T...,http://www.tate.org.uk/art/artworks/roberts-fa...
67980,99346,T12641,"Roberts, William",artist,1855,Peasants and horseman,no date,Graphite and watercolour on paper,Accepted by HM Government in lieu of inheritan...,no date,2008.0,184 x 140 mm,,,,,,© The estate of William Roberts,http://www.tate.org.uk/art/images/work/T/T12/T...,http://www.tate.org.uk/art/artworks/roberts-pe...
67987,99354,T12648,"Roberts, William",artist,1855,The Beatles,no date,Graphite on paper,Accepted by HM Government in lieu of inheritan...,no date,2008.0,178 x 127 mm,,,,,,© The estate of William Roberts,http://www.tate.org.uk/art/images/work/T/T12/T...,http://www.tate.org.uk/art/artworks/roberts-th...


In [0]:
# always include the column name while doing this

df.loc[df.year.notnull() & df.year.astype(str).str.contains('[^0-9]'), ['year']] = np.nan

In [27]:
# year is changed to NaN
df.loc[67968:67969]

Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
67968,99332,T12629,"Roberts, William",artist,1855,Families on a beach,no date,Graphite on paper,Accepted by HM Government in lieu of inheritan...,,2008.0,support: 127 x 178 mm,127,178,,mm,,© The estate of William Roberts,http://www.tate.org.uk/art/images/work/T/T12/T...,http://www.tate.org.uk/art/artworks/roberts-fa...
67969,99333,T12630,"Roberts, William",artist,1855,The Horsemen,1920,Watercolour and graphite on paper,Accepted by HM Government in lieu of inheritan...,1920.0,2008.0,support: 152 x 184 mm,152,184,,mm,,© The estate of William Roberts,http://www.tate.org.uk/art/images/work/T/T12/T...,http://www.tate.org.uk/art/artworks/roberts-th...


# Filling missing data with a value

In some cases, it is fitting to replace all the NaNs with some value, usually zero. ML models can't process NaNs. So, we replace 'em with something else.

In [32]:
# only fills na in depth column

df.fillna(value={'depth': 0}, inplace = True)
df.depth[67968: 67969]

# 0.0 is filled in the depth column.

67968    0.0
Name: depth, dtype: float64

# Dropping rows of data

In [33]:
df.shape

(69201, 20)

In [34]:
# drops if atleast 15 columns in te row are set to nan
df.dropna(thresh = 15).shape

(66569, 20)

In [35]:
# drop any row which has year OR acquisitionYear is NaN
df.dropna(subset = ['year', 'acquisitionYear']).shape

(63762, 20)

In [36]:
# year and acquisitionYear
df.dropna(subset = ['year', 'acquisitionYear'], how = 'all').shape

(69198, 20)

# Identifying and Dropping Duplicates

use inplace = True to set it to the original dataset.

In [37]:
df.shape

(69201, 20)

In [38]:
# drops if every single value matches every single value in a row
df.drop_duplicates().shape

(69201, 20)

In [39]:
df.drop_duplicates(subset = ['artist']).shape

(3336, 20)

In [40]:
# keeps the first row
df.drop_duplicates(subset = ['artist'], keep = 'first').shape

(3336, 20)

In [41]:
# deletes all dupicates

df.drop_duplicates(subset = ['artist'], keep = False).shape

(1446, 20)