<a href="https://colab.research.google.com/github/abiflynn/python_pandas/blob/main/challenge_14_python_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Challenge 14: Data Cleaning

In [29]:
import pandas as pd

In [30]:
# death_metal.csv
url = 'https://drive.google.com/file/d/11HsCgxJL_PtJ8xxdT5VZbw6e0y0-VKag/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
bands = pd.read_csv(path)

In [31]:
bands.head()

Unnamed: 0,name,country,status,formed_in,genre,theme,active
0,**Act of Destruction?>,united states,ac,2005.0,Melodic Death/Thrash Metal,Death| Love| Life| Evil| Darker Tones,2005-present
1,**Nirvana 2002?>,sweden,su,1988.0,Death Metal,Metaphysical Philosophy| Parapsychology,1988-1992
2,**Olemus?>,austria,ac,1993.0,Death/Black/Gothic Metal,Sadness| Life| Death,1993-present
3,**Misanthrope?>,mexico,oh,2010.0,Death Metal,Death| Destruction| War| Decadence,2010-present
4,**Detonator?>,russia,su,1991.0,Technical Death/Thrash Metal,Loneliness philosophy| state of mind of the pe...,1991-2002


In [32]:
bands.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   name       26 non-null     object 
 1   country    26 non-null     object 
 2   status     26 non-null     object 
 3   formed_in  26 non-null     float64
 4   genre      26 non-null     object 
 5   theme      26 non-null     object 
 6   active     26 non-null     object 
dtypes: float64(1), object(6)
memory usage: 1.5+ KB


### **Exercise 1:** 
Cleaning the 'name' column

In [33]:
bands['name'].head(5)

0    **Act of Destruction?>
1          **Nirvana 2002?>
2                **Olemus?>
3           **Misanthrope?>
4             **Detonator?>
Name: name, dtype: object

In [34]:
bands["name"] = bands["name"].str.replace("\W", '', regex=True)
bands

Unnamed: 0,name,country,status,formed_in,genre,theme,active
0,ActofDestruction,united states,ac,2005.0,Melodic Death/Thrash Metal,Death| Love| Life| Evil| Darker Tones,2005-present
1,Nirvana2002,sweden,su,1988.0,Death Metal,Metaphysical Philosophy| Parapsychology,1988-1992
2,Olemus,austria,ac,1993.0,Death/Black/Gothic Metal,Sadness| Life| Death,1993-present
3,Misanthrope,mexico,oh,2010.0,Death Metal,Death| Destruction| War| Decadence,2010-present
4,Detonator,russia,su,1991.0,Technical Death/Thrash Metal,Loneliness philosophy| state of mind of the pe...,1991-2002
5,BloodAgent,germany,ac,2013.0,Death/Thrash Metal,War| Death| Apocalypse,2013-present
6,Traumagain,italy,ac,2001.0,Brutal Death Metal,Nihilism| Death,2001-present
7,Anaktorian,finland,su,2001.0,Melodic Death Metal,Death| emotions| pain,2001-2007
8,Revolt,italy,cn,2001.0,Death/Thrash Metal,Society| Hate,2001-2006
9,Coldworker,sweden,su,2006.0,Death Metal,Death,2006-2013


### **Exercise 2:** 
Cleaning the country column


In [35]:
bands['country'].head(5)

0    united states
1           sweden
2          austria
3           mexico
4           russia
Name: country, dtype: object

In [36]:
bands.loc[:,"country"] = bands.country.str.upper()
bands.head(5)


Unnamed: 0,name,country,status,formed_in,genre,theme,active
0,ActofDestruction,UNITED STATES,ac,2005.0,Melodic Death/Thrash Metal,Death| Love| Life| Evil| Darker Tones,2005-present
1,Nirvana2002,SWEDEN,su,1988.0,Death Metal,Metaphysical Philosophy| Parapsychology,1988-1992
2,Olemus,AUSTRIA,ac,1993.0,Death/Black/Gothic Metal,Sadness| Life| Death,1993-present
3,Misanthrope,MEXICO,oh,2010.0,Death Metal,Death| Destruction| War| Decadence,2010-present
4,Detonator,RUSSIA,su,1991.0,Technical Death/Thrash Metal,Loneliness philosophy| state of mind of the pe...,1991-2002


### **Exercise 3:**
Cleaning the status column

The status column has some abbreviations instead of the real status.  
Change them in accordance with this:  
* ac = Active  
* su = Split-up  
* cn = Changed name  
* oh = On hold  
* un = Unknown


In [37]:
bands['status'].head(5)

0    ac
1    su
2    ac
3    oh
4    su
Name: status, dtype: object

In [38]:
bands["status"] = bands["status"].replace({'ac': "Active", 'su': "Split-Up", 'cn': "Changed Name", 'oh':  "On Hold", 'un': "Unknown"})
bands.head()

Unnamed: 0,name,country,status,formed_in,genre,theme,active
0,ActofDestruction,UNITED STATES,Active,2005.0,Melodic Death/Thrash Metal,Death| Love| Life| Evil| Darker Tones,2005-present
1,Nirvana2002,SWEDEN,Split-Up,1988.0,Death Metal,Metaphysical Philosophy| Parapsychology,1988-1992
2,Olemus,AUSTRIA,Active,1993.0,Death/Black/Gothic Metal,Sadness| Life| Death,1993-present
3,Misanthrope,MEXICO,On Hold,2010.0,Death Metal,Death| Destruction| War| Decadence,2010-present
4,Detonator,RUSSIA,Split-Up,1991.0,Technical Death/Thrash Metal,Loneliness philosophy| state of mind of the pe...,1991-2002


### **Exercise 4:**
Cleaning the genre column

The column genre has genres in a single string separated by character /  
1. First, transform the string to list of strings  
(e.g. 'Avant-garde Black/Death Metal'    to     \[Avant-garde Black, Death Metal\]  
1. Then, create a new column 'number_of_genres' where you will store the number of genres in each list.  


In [39]:
bands['genre'].head(5)

0      Melodic Death/Thrash Metal
1                     Death Metal
2        Death/Black/Gothic Metal
3                     Death Metal
4    Technical Death/Thrash Metal
Name: genre, dtype: object

In [40]:
bands['genre'] = bands['genre'].str.split('/')
bands["genre"] = list(bands["genre"])
bands["number_of_genres"] = bands["genre"].str.len()
bands.head()


Unnamed: 0,name,country,status,formed_in,genre,theme,active,number_of_genres
0,ActofDestruction,UNITED STATES,Active,2005.0,"[Melodic Death, Thrash Metal]",Death| Love| Life| Evil| Darker Tones,2005-present,2
1,Nirvana2002,SWEDEN,Split-Up,1988.0,[Death Metal],Metaphysical Philosophy| Parapsychology,1988-1992,1
2,Olemus,AUSTRIA,Active,1993.0,"[Death, Black, Gothic Metal]",Sadness| Life| Death,1993-present,3
3,Misanthrope,MEXICO,On Hold,2010.0,[Death Metal],Death| Destruction| War| Decadence,2010-present,1
4,Detonator,RUSSIA,Split-Up,1991.0,"[Technical Death, Thrash Metal]",Loneliness philosophy| state of mind of the pe...,1991-2002,2


### **Exercise 5:** 
Cleaning the active column

The column `active` contains information about the years when a band was active or '?' if the status or year is unknown.  

Create two new columns 
- `active_from`: when the band formed
- `active_to`: when the band broke up

and fill them with the information contained in the column `active`.  



In [41]:
bands['active'].head(5)

0    2005-present
1       1988-1992
2    1993-present
3    2010-present
4       1991-2002
Name: active, dtype: object

In [42]:
bands[["active_from", "active_to"]] = bands["active"].str.split("-", n=2, expand=True)
bands.head()



Unnamed: 0,name,country,status,formed_in,genre,theme,active,number_of_genres,active_from,active_to
0,ActofDestruction,UNITED STATES,Active,2005.0,"[Melodic Death, Thrash Metal]",Death| Love| Life| Evil| Darker Tones,2005-present,2,2005,present
1,Nirvana2002,SWEDEN,Split-Up,1988.0,[Death Metal],Metaphysical Philosophy| Parapsychology,1988-1992,1,1988,1992
2,Olemus,AUSTRIA,Active,1993.0,"[Death, Black, Gothic Metal]",Sadness| Life| Death,1993-present,3,1993,present
3,Misanthrope,MEXICO,On Hold,2010.0,[Death Metal],Death| Destruction| War| Decadence,2010-present,1,2010,present
4,Detonator,RUSSIA,Split-Up,1991.0,"[Technical Death, Thrash Metal]",Loneliness philosophy| state of mind of the pe...,1991-2002,2,1991,2002


### **Exercise 6** 
Counting the themes

Count how many times do the words Love, Life, Death repeat in a themes column.  

In [43]:
bands['theme'].head(5)

0                Death| Love| Life| Evil| Darker Tones
1              Metaphysical Philosophy| Parapsychology
2                                 Sadness| Life| Death
3                   Death| Destruction| War| Decadence
4    Loneliness philosophy| state of mind of the pe...
Name: theme, dtype: object

In [46]:
bands["theme_love"] = bands["theme"].str.count("Love")
bands["theme_life"] = bands["theme"].str.count("Life")
bands["theme_death"] = bands["theme"].str.count("Death")
bands["total_themes"] = bands["theme_love"] + bands["theme_life"] + bands["theme_death"]
bands_final = bands.drop(columns=["theme_love", "theme_life", "theme_death"])
bands_final.head()

Unnamed: 0,name,country,status,formed_in,genre,theme,active,number_of_genres,active_from,active_to,total_themes
0,ActofDestruction,UNITED STATES,Active,2005.0,"[Melodic Death, Thrash Metal]",Death| Love| Life| Evil| Darker Tones,2005-present,2,2005,present,3
1,Nirvana2002,SWEDEN,Split-Up,1988.0,[Death Metal],Metaphysical Philosophy| Parapsychology,1988-1992,1,1988,1992,0
2,Olemus,AUSTRIA,Active,1993.0,"[Death, Black, Gothic Metal]",Sadness| Life| Death,1993-present,3,1993,present,2
3,Misanthrope,MEXICO,On Hold,2010.0,[Death Metal],Death| Destruction| War| Decadence,2010-present,1,2010,present,1
4,Detonator,RUSSIA,Split-Up,1991.0,"[Technical Death, Thrash Metal]",Loneliness philosophy| state of mind of the pe...,1991-2002,2,1991,2002,0
