# Demo 3.8 - Part 3:  Drop, Renaming, Cleaning out Junk  

- **Data Cleaning Skills Demonstrated:**  
  1. **Dropping** a Column  
  2. **Renaming** a Column  
  3. **String Columns**:  
     1. Get rid of Leading and Trailing Spaces with *strip()*   
  4. **Numeric Columns**:  
     1. Get rid of junk with *replace()*  


In [22]:
import pandas as pd

### Read data from Part 1 Demo Output

In [23]:
df = pd.read_csv("Demo_3.8_P1_SP500_Stocks.csv")

print(df.shape)
df.head()

(503, 8)


Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989


In [24]:
df.dtypes

Symbol                   object
Security                 object
GICS Sector              object
GICS Sub-Industry        object
Headquarters Location    object
Date added               object
CIK                       int64
Founded                  object
dtype: object

# Clean Data (as needed)

- Drop the CIK column

In [25]:
df.drop('CIK', axis='columns', inplace=True)

print(df.shape)
df.head()

(503, 7)


Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,1916
2,ABT,Abbott,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1989


- Rename the `GICS Sector` column to `Sector`

In [26]:
df.rename(columns={'GICS Sector': 'Sector'}, inplace=True)

print(df.shape)
df.head()

(503, 7)


Unnamed: 0,Symbol,Security,Sector,GICS Sub-Industry,Headquarters Location,Date added,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,1916
2,ABT,Abbott,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1989


In [27]:
df.columns

Index(['Symbol', 'Security', 'Sector', 'GICS Sub-Industry',
       'Headquarters Location', 'Date added', 'Founded'],
      dtype='object')

- Get rid of the leading and trailing spaces with strip()

In [28]:
# Get rid of leading and trailing spaces with strip()
df.columns = df.columns.str.strip()

print(df.shape)
df.head()

(503, 7)


Unnamed: 0,Symbol,Security,Sector,GICS Sub-Industry,Headquarters Location,Date added,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,1916
2,ABT,Abbott,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1989


- Introduce some Junk Data

In [29]:
# Introduce some junk data for demonstration
df.loc[0, 'Sector'] = 'Industrials!'
df.loc[1, 'Sector'] = '#Industrials'

# Display the original values
print("Values in 'Sector' column:")
print(df['Sector'].head())

Values in 'Sector' column:
0              Industrials!
1              #Industrials
2               Health Care
3               Health Care
4    Information Technology
Name: Sector, dtype: object


- Get rid of junk characters with replace()

In [30]:
df['Sector'] = df['Sector'].replace({'#': '', '!': ''}, regex=True) 
# .replace({'\\$': '',',': ''}, regex=True) Would remove the dollar sign and comma characters from a string, for example.
print(df.shape)
df.head()


(503, 7)


Unnamed: 0,Symbol,Security,Sector,GICS Sub-Industry,Headquarters Location,Date added,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,1916
2,ABT,Abbott,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1989
