Data with a consistent format is often described as "clean." As data scientists, not all data we encounter is clean; we often we need to prepare it in a process called **data cleaning**.

We going to work with data about the art in the Museum of Modern Art (MoMA). MoMA, a museum in New York City, has one of the largest collections of modern art in the world.

# Data dictionary for the MoMA

* Title: The title of the artwork.
* Artist: The name of the artist who created the artwork.
* Nationality: The nationality of the artist.
* BeginDate: The year in which the artist was born.
* EndDate: The year in which the artist died.
* Gender: The gender of the artist.
* Date: The date that the artwork was created.
* Department: The department inside MoMA to which the artwork belongs.

In [5]:
from csv import reader

opened_file = open("artworks.csv", encoding = "utf-8")
read_file = reader(opened_file)
moma = list(read_file)
moma_header = moma[0]
moma = moma[1:]


Often when we're cleaning data, we need to replace parts of strings so our data is consistent.

For example, let's say we have the string "red is my favorite color", but we want to change it to "blue is my favorite color". To do that, we want to replace the "red" part of the string with "blue". When we want to refer to part of a string, we use the term substring.

* Parts of strings are called substrings.
* We can use the str.replace() method to find and replace substrings.
* str.replace() requires two arguments:
  * old: The substring we want to find and replace.
  * new: The substring we want to replace old with.
* When we use str.replace(), we substitute the str for the variable name of the string we want to modify.
* We need to use = to assign the modified string to a variable name.

In [3]:
# for learning purpose just consider below example of replacing value
age1 = "I am thirty-one years old" 

age2 = age1.replace("thirty-one", "thirty-two")


# Cleaning Nationality and Gender

In [3]:
for item in moma:
    nationality = item[2]
    nationality = nationality.replace("(", "")
    nationality = nationality.replace(")", "")
    item[2] = nationality
    
    gender = item[5]
    gender = gender.replace("(", "")
    gender = gender.replace(")", "")
    item[5] = gender
    
   

In [6]:
# replacing empty sting
for item in moma:
    gender = item[5]
    gender = gender.title()
    if gender == "":
        gender = gender.replace("", "Gender Unknown/Other")
    item[5] = gender
    
    nationality = item[2]
    nationality = nationality.title()
    if nationality == "":
        nationality = nationality.replace("","Nationality Unknown")
    item[2] = nationality
    


# Cleaning begin and end dates

In [5]:
def clean_and_convert(date):
    if date != "":
        date = date.replace("(","")
        date = date.replace(")","")
        date = int(date)
    return date

In [6]:
for item in moma:
    birth_date = item[3]
    death_date = item[4]
    
    birth_date = clean_and_convert(birth_date)
    death_date = clean_and_convert(death_date)
    
    item[3] = birth_date
    item[4] = death_date
    


# Cleaning Date column

In [7]:
dates = []

for item in moma:
    date = item[6]
    dates.append(date)


In [8]:
test_data = ["1912", "1929", "1913-1923",
             "(1951)", "1994", "1934",
             "c. 1915", "1995", "c. 1912",
             "(1988)", "2002", "1957-1959",
             "c. 1955.", "c. 1970's", 
             "C. 1990-1999"]

bad_chars = ["(",")","c","C",".","s","'", " "]

In [9]:
def strip_characters(string):
    for char in bad_chars:
        string = string.replace(char,"")
    return string
        

In [10]:
for item in moma:
    date = item[6]
    date = strip_characters(date)
    item[6] = date



In [11]:
def process_date(string):
    if "-" in string:
        string = string.split("-")
        frst_indx = int(string[0])
        sec_indx = int(string[1])
        avg_value = round((frst_indx+sec_indx)/2)
        string = avg_value
    else:
        string = int(string)
    return string

In [12]:
for item in moma:
    date = item[6]
    date = strip_characters(date)
    date = process_date(date)
    item[6] = date
    


In [13]:
moma[0:3] 

[['Dress MacLeod from Tartan Sets',
  'Sarah Charlesworth',
  'American',
  1947,
  2013,
  'Female',
  1986,
  'Prints & Illustrated Books'],
 ['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA',
  'Pablo Palazuelo',
  'Spanish',
  1916,
  2007,
  'Male',
  1978,
  'Prints & Illustrated Books'],
 ['Tailpiece (page 55) from SAGESSE',
  'Maurice Denis',
  'French',
  1870,
  1943,
  'Male',
  1900,
  'Prints & Illustrated Books']]

# Preparing a CSV containing all of the data cleaning I performed, called artworks_clean.csv. 

In [14]:
import csv
f = open("artworks_clean.csv", "w",newline= "",encoding="utf-8")
writer = csv.writer(f, delimiter = ",")
writer.writerow(moma_header)
for item in moma:
    writer.writerow(item)

In [15]:
from csv import reader

moma = list(reader(open("artworks_clean.csv", encoding = "utf-8")))
moma[0:3]


[['Title',
  'Artist',
  'Nationality',
  'BeginDate',
  'EndDate',
  'Gender',
  'Date',
  'Department'],
 ['Dress MacLeod from Tartan Sets',
  'Sarah Charlesworth',
  'American',
  '1947',
  '2013',
  'Female',
  '1986',
  'Prints & Illustrated Books'],
 ['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA',
  'Pablo Palazuelo',
  'Spanish',
  '1916',
  '2007',
  'Male',
  '1978',
  'Prints & Illustrated Books']]