<a href="https://colab.research.google.com/github/drpetros11111/DSc_ML_for_Business/blob/03_Pandas/4_2_Pandas_3_Data_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas - Data Cleaning

1. Renaming Columns
2. Re-arranging Column Order
3. Checking data types of specific columns
4. Removing Text from column
5. Deaing with Missing Data
6. Changing Data Types
7. Replacing Text within a column
8. String operations of column data
9. Removing Columns
10. Dropping Rows


In [1]:
# Let's load a new dataset on the number of fires in the Amazon rainforest

import pandas as pd

file_name = "https://raw.githubusercontent.com/rajeevratan84/datascienceforbusiness/master/amazon_fires.csv"
df = pd.read_csv(file_name, encoding = "ISO-8859-1")

df.tail()

Unnamed: 0,ano,mes,estado,numero,encontro
6449,2012,Dezembro,Tocantins,128,1/1/2012
6450,2013,Dezembro,Tocantins,85,1/1/2013
6451,2014,Dezembro,Tocantins,223,1/1/2014
6452,2015,Dezembro,Tocantins,373,1/1/2015
6453,2016,Dezembro,Tocantins,119,1/1/2016


In [2]:
# How many regions are in the dataset?

df['estado'].unique()

array(['Acre', 'alagoas', 'Amapa', 'Amazonas', 'Bahia', 'Ceara',
       'Distrito Federal', 'Espirito Santo', 'Goias', 'Maranhao',
       'Mato Grosso', 'Minas Gerais', 'pará', 'Paraiba', 'Pernambuco',
       'Piau', 'Rio', 'rondonia', 'Roraima', 'Santa Catarina',
       'Sao Paulo', 'Sergipe', 'Tocantins'], dtype=object)

# Renaming Columns

In [3]:
new_columns = {'ano' : 'year',
               'estado': 'state',
               'mes': 'month',
               'numero': 'number_of_fires',
               'encontro': 'date'}

df.rename(columns = new_columns, inplace=True)

In [4]:
df.head()

Unnamed: 0,year,month,state,number_of_fires,date
0,1998,Janeiro,Acre,0 Fires,1/1/1998
1,1999,Janeiro,Acre,0 Fires,1/1/1999
2,2000,Janeiro,Acre,0 Fires,1/1/2000
3,2001,Janeiro,Acre,0 Fires,1/1/2001
4,2002,Janeiro,Acre,0 Fires,1/1/2002


In [5]:
# How many years of data do we have?
df['year'].unique()

array([1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008,
       2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017])

In [6]:
# Let's explore our datetypes, we should expect number_of_types to be an integer or float datatype

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6454 entries, 0 to 6453
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   year             6454 non-null   int64 
 1   month            6454 non-null   object
 2   state            6454 non-null   object
 3   number_of_fires  6322 non-null   object
 4   date             6454 non-null   object
dtypes: int64(1), object(4)
memory usage: 252.2+ KB


In [7]:
df.head()

Unnamed: 0,year,month,state,number_of_fires,date
0,1998,Janeiro,Acre,0 Fires,1/1/1998
1,1999,Janeiro,Acre,0 Fires,1/1/1999
2,2000,Janeiro,Acre,0 Fires,1/1/2000
3,2001,Janeiro,Acre,0 Fires,1/1/2001
4,2002,Janeiro,Acre,0 Fires,1/1/2002


# Re-arranging columns

In [8]:
# Columns are numbered from 0, left to right
# Let's put date first, month second and year 3rd

new_order = [4,1,0,2,3,]
df = df[df.columns[new_order]]
df.head()

Unnamed: 0,date,month,year,state,number_of_fires
0,1/1/1998,Janeiro,1998,Acre,0 Fires
1,1/1/1999,Janeiro,1999,Acre,0 Fires
2,1/1/2000,Janeiro,2000,Acre,0 Fires
3,1/1/2001,Janeiro,2001,Acre,0 Fires
4,1/1/2002,Janeiro,2002,Acre,0 Fires


In [9]:
df.head(25)

Unnamed: 0,date,month,year,state,number_of_fires
0,1/1/1998,Janeiro,1998,Acre,0 Fires
1,1/1/1999,Janeiro,1999,Acre,0 Fires
2,1/1/2000,Janeiro,2000,Acre,0 Fires
3,1/1/2001,Janeiro,2001,Acre,0 Fires
4,1/1/2002,Janeiro,2002,Acre,0 Fires
5,1/1/2003,Janeiro,2003,Acre,10 Fires
6,1/1/2004,Janeiro,2004,Acre,0 Fires
7,1/1/2005,Janeiro,2005,Acre,12 Fires
8,1/1/2006,Janeiro,2006,Acre,4 Fires
9,1/1/2007,Janeiro,2007,Acre,0 Fires


In [10]:
df.tail()

Unnamed: 0,date,month,year,state,number_of_fires
6449,1/1/2012,Dezembro,2012,Tocantins,128
6450,1/1/2013,Dezembro,2013,Tocantins,85
6451,1/1/2014,Dezembro,2014,Tocantins,223
6452,1/1/2015,Dezembro,2015,Tocantins,373
6453,1/1/2016,Dezembro,2016,Tocantins,119


# Determing if a column contains numeric data

In [11]:
# It isn't, let's find our why

df['number_of_fires'].str.isnumeric()

Unnamed: 0,number_of_fires
0,False
1,False
2,False
3,False
4,False
...,...
6449,True
6450,True
6451,True
6452,True


# Determine if the values in a column is numeric
The code snippet you provided aims to determine if the values in the number_of_fires column of the DataFrame df are numeric.

----------------------------
# Explanation:

    df['number_of_fires']
This selects the 'number_of_fires' column from the DataFrame df.

    .str.isnumeric()

This part applies the isnumeric() method to each element of the selected column.

isnumeric() is a string method that checks if all characters in a string are numeric.

It returns True if all characters are numeric, and False otherwise.

------------------------------
Reasoning:

The comment "# It isn't, let's find out why" suggests that the code's author expected the 'number_of_fires' column to contain numeric data but found that it wasn't entirely the case.

They then use str.isnumeric() to investigate the reason for this discrepancy.

In simpler terms:

The code checks each value in the 'number_of_fires' column to see if it is a number.

The comment indicates that the author anticipated numeric data in this column but discovered that it wasn't completely true, leading them to investigate further using str.isnumeric().

In [13]:
# Convert 'number_of_fires' to string type to ensure str.isdigit() works correctly.
# Fill NaN values with empty strings to avoid errors during boolean indexing.
df['number_of_fires'] = df['number_of_fires'].astype(str).fillna('')

# Now, the following line should work without raising the ValueError
df[df['number_of_fires'].str.isdigit()]

Unnamed: 0,date,month,year,state,number_of_fires
239,1/1/1998,Janeiro,1998,alagoas,0
240,1/1/1999,Janeiro,1999,alagoas,58
241,1/1/2000,Janeiro,2000,alagoas,11
242,1/1/2001,Janeiro,2001,alagoas,5
243,1/1/2002,Janeiro,2002,alagoas,12
...,...,...,...,...,...
6449,1/1/2012,Dezembro,2012,Tocantins,128
6450,1/1/2013,Dezembro,2013,Tocantins,85
6451,1/1/2014,Dezembro,2014,Tocantins,223
6452,1/1/2015,Dezembro,2015,Tocantins,373


In [14]:
# We get the above error because our isdigit() returns Nan for blank or missing values
# To fix this we need to convert our column datatype from non-null objects to a String

df['number_of_fires'].astype(str).str.isdigit()

Unnamed: 0,number_of_fires
0,False
1,False
2,False
3,False
4,False
...,...
6449,True
6450,True
6451,True
6452,True



- Bascially, `str.isdigit` only returns True for strings containing solely the digits 0-9.
- By contrast, `str.isnumeric` returns True if it contains any numeric characters. e.g. '½'

# Removing unnecessary text from columns

In [None]:
df['number_of_fires'].str.strip(" Fires")

0         0
1         0
2         0
3         0
4         0
       ... 
6449    128
6450     85
6451    223
6452    373
6453    119
Name: number_of_fires, Length: 6454, dtype: object

    df['number_of_fires']

This part selects the column named 'number_of_fires' from your DataFrame df. This column likely contains information about the number of fires.

##.str:
This is called an "accessor". It allows you to apply string methods to the values within the selected column (Series).

In this case, it's preparing to use a string method on each element of the 'number_of_fires' column.

##.strip(" Fires"):
This is the core of the operation. The .strip() method is a string function that removes leading and trailing characters from a string.

Here, it's specifically instructed to remove the characters " Fires" (including the space) from the beginning and end of each value in the 'number_of_fires' column.

In simpler terms:

This line of code is cleaning up the 'number_of_fires' column by removing the phrase " Fires" if it's present at the beginning or end of any values in the column.

-----------------------------------------

#Example:

Let's say your 'number_of_fires' column has these values:

    '123 Fires'
    '45 Fires'
    '678'
    ' Fires 90'

Use code with caution
After running df['number_of_fires'].str.strip(" Fires"), the column would be modified to:

    '123'
    '45'
    '678'
    '90'

Notice how " Fires" was removed from the first two entries and only " Fires " was removed from the last entry.

This can be useful to standardize the values in your column for further analysis or calculations.

Strip - Return a copy of the string with leading and trailing characters removed. If chars is omitted or None, whitespace characters are removed. If given and not None, chars must be a string; the characters in the string will be stripped from the both ends of the string this method is called on.

In [15]:
# To replace column with cleaned column

df['number_of_fires'] = df['number_of_fires'].str.strip(" Fires")
df.head()

Unnamed: 0,date,month,year,state,number_of_fires
0,1/1/1998,Janeiro,1998,Acre,0
1,1/1/1999,Janeiro,1999,Acre,0
2,1/1/2000,Janeiro,2000,Acre,0
3,1/1/2001,Janeiro,2001,Acre,0
4,1/1/2002,Janeiro,2002,Acre,0


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6454 entries, 0 to 6453
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   date             6454 non-null   object
 1   month            6454 non-null   object
 2   year             6454 non-null   int64 
 3   state            6454 non-null   object
 4   number_of_fires  6454 non-null   object
dtypes: int64(1), object(4)
memory usage: 252.2+ KB


In [18]:
# We need to convert our number_of_fires column to a float data type
# Replacing any non-digit characters with an empty string

df["number_of_fires"] = pd.to_numeric(df["number_of_fires"].str.replace('[^0-9]', '', regex=True), errors='coerce')
# errors='coerce' will replace any values that cannot be converted to numeric with NaN
df.head()

Unnamed: 0,date,month,year,state,number_of_fires
0,1/1/1998,Janeiro,1998,Acre,0.0
1,1/1/1999,Janeiro,1999,Acre,0.0
2,1/1/2000,Janeiro,2000,Acre,0.0
3,1/1/2001,Janeiro,2001,Acre,0.0
4,1/1/2002,Janeiro,2002,Acre,0.0


# Converting a series to numeric type

##Import pandas:
If not already imported, you'll need to import the pandas library using import pandas as pd.

##Using pd.to_numeric:
We use pd.to_numeric which is a more robust way of converting a series to numeric type.

##Replacing Non-numeric Characters:

We employ str.replace('[^0-9]', '', regex=True) which uses a regular expression to replace any character that is not a digit (0-9) with an empty string.

[]: Square brackets define a character set. This means it will match any single character within the brackets.
^: When the caret symbol (^) is used as the first character inside square brackets, it negates the character set. This means it will match any character that is not in the set.
0-9: This represents a range of characters, specifically all digits from 0 to 9.
Therefore, [^0-9] combined means: "match any character that is not a digit from 0 to 9".

In the context of the code:


    df["number_of_fires"] = pd.to_numeric(df["number_of_fires"].str.replace('[^0-9]', '', regex=True), errors='coerce')

The str.replace('[^0-9]', '', regex=True) part is applied to the 'number_of_fires' column.

It finds all occurrences of any non-digit character ([^0-9]) and replaces them with an empty string ('').

This effectively removes all non-numeric characters from the column, leaving only the digits.

Finally, pd.to_numeric converts the cleaned column to a numeric data type, allowing you to perform numerical operations on it.

-------------------------------
##Example:

If the 'number_of_fires' column contained a value like "123 Fires", the regex [^0-9] would match " Fires" (space, F, i, r, e, s).

These characters would then be replaced with an empty string, resulting in "123", which can be successfully converted to a numeric type.

This ensures that only numbers remain in the column before conversion to float.

##Handling Errors:
errors='coerce' is added to the pd.to_numeric function to handle potential errors during conversion.

This will replace any values that cannot be converted with NaN (Not a Number).

This is important for handling unexpected data entries which can cause the code to fail.

In [19]:
# That was one way we could have handled blank data

# Handling missing data

In [20]:
# Let's reload our dataframe
file_name = "https://raw.githubusercontent.com/rajeevratan84/datascienceforbusiness/master/amazon_fires.csv"
df = pd.read_csv(file_name, encoding = "ISO-8859-1")
new_columns = {'ano' : 'year',
               'estado': 'state',
               'mes': 'month',
               'numero': 'number_of_fires',
               'encontro': 'date'}
df.rename(columns = new_columns, inplace=True)
df['number_of_fires'] = df['number_of_fires'].str.strip(" Fires")
# Creating a true copy of our dataframe
df_copy = df.copy()
df.head()

Unnamed: 0,year,month,state,number_of_fires,date
0,1998,Janeiro,Acre,0,1/1/1998
1,1999,Janeiro,Acre,0,1/1/1999
2,2000,Janeiro,Acre,0,1/1/2000
3,2001,Janeiro,Acre,0,1/1/2001
4,2002,Janeiro,Acre,0,1/1/2002


In [21]:
# Viewing the sum of missing values in each column

df.isnull().sum()

Unnamed: 0,0
year,0
month,0
state,0
number_of_fires,132
date,0


In [22]:
# We can easily remove Null or NaN (not a number) values

# Drop rows with NaN values
df = df.dropna()
df = df.reset_index() # reset's row indexes in case any rows were dropped
df.head()

Unnamed: 0,index,year,month,state,number_of_fires,date
0,0,1998,Janeiro,Acre,0,1/1/1998
1,1,1999,Janeiro,Acre,0,1/1/1999
2,2,2000,Janeiro,Acre,0,1/1/2000
3,3,2001,Janeiro,Acre,0,1/1/2001
4,4,2002,Janeiro,Acre,0,1/1/2002


In [23]:
# Let's check and see it worked

df.isnull().sum()

Unnamed: 0,0
index,0
year,0
month,0
state,0
number_of_fires,0
date,0


In [24]:
# Alright so it worked, now let's reload the data and look at a few other methods of dealing with NaN or Null values

# Let's reload our dataframe
file_name = "https://raw.githubusercontent.com/rajeevratan84/datascienceforbusiness/master/amazon_fires.csv"
df = pd.read_csv(file_name, encoding = "ISO-8859-1")
new_columns = {'ano' : 'year',
               'estado': 'state',
               'mes': 'month',
               'numero': 'number_of_fires',
               'encontro': 'date'}
df.rename(columns = new_columns, inplace=True)
df['number_of_fires'] = df['number_of_fires'].str.strip(" Fires")
df_copy = df.copy()
df.head()

Unnamed: 0,year,month,state,number_of_fires,date
0,1998,Janeiro,Acre,0,1/1/1998
1,1999,Janeiro,Acre,0,1/1/1999
2,2000,Janeiro,Acre,0,1/1/2000
3,2001,Janeiro,Acre,0,1/1/2001
4,2002,Janeiro,Acre,0,1/1/2002


In [25]:
# Create a boolean index for all null values

df['number_of_fires'].isnull()

Unnamed: 0,number_of_fires
0,False
1,False
2,False
3,False
4,False
...,...
6449,False
6450,False
6451,False
6452,False


In [26]:
df[df['number_of_fires'].isnull()].head(10)

Unnamed: 0,year,month,state,number_of_fires,date
68,2006,Abril,Acre,,1/1/2006
110,2008,Junho,Acre,,1/1/2008
127,2005,Julho,Acre,,1/1/2005
206,2004,Novembro,Acre,,1/1/2004
217,2015,Novembro,Acre,,1/1/2015
444,2002,Novembro,alagoas,,1/1/2002
522,2001,Março,Amapa,,1/1/2001
550,2009,Abril,Amapa,,1/1/2009
614,2013,Julho,Amapa,,1/1/2013
642,2001,Setembro,Amapa,,1/1/2001


# What do to with missing data?

* Remove them via .dropna(axis=0)
* Replace them with some arbitary number (e.g. an average)
* Replace them zeros, or Forward Fill (ffill) or Back Fill (backfill)

In [27]:
# Using fillna with zeros

df['number_of_fires'].fillna(0).head()

Unnamed: 0,number_of_fires
0,0
1,0
2,0
3,0
4,0


In [28]:
# Let's try back filling
df['number_of_fires'].fillna(method='ffill').head(70)

  df['number_of_fires'].fillna(method='ffill').head(70)


Unnamed: 0,number_of_fires
0,0
1,0
2,0
3,0
4,0
...,...
65,1
66,2
67,1
68,1


# Fill missing values (NaN) using a technique called forward fill or 'ffill'.



---


Let's examine it step-by-step:

##df['number_of_fires']:

This selects the 'number_of_fires' column of your DataFrame df.

##.fillna(method='ffill'):

This is the core of the operation.
fillna is a pandas function used to fill missing values (represented as NaN) in a Series or DataFrame.

##method='ffill'
specifies the method to use for filling. 'ffill' stands for 'forward fill'.

It propagates the last valid observation forward to the next valid observation.

In simpler terms, it uses the previous non-missing value to fill the current missing value.

##.head(70):
This part is optional and simply displays the first 70 rows of the resulting Series after the forward fill has been applied.

It's mainly for inspection and doesn't modify the DataFrame itself.

------------------
#In summary

The code attempts to fill missing values (NaN) in the 'number_of_fires' column.

It uses the forward fill method ('ffill'), which means it replaces each NaN with the previous non-missing value in the column.

It then shows the first 70 rows of the filled column for you to review.
Example:

Let's say your 'number_of_fires' column looks like this:


    1
    2
    NaN
    4
    NaN
    NaN
    7

After applying fillna(method='ffill'), it would become:

    1
    2
    2  # NaN filled with the previous value (2)
    4
    4  # NaN filled with the previous value (4)
    4  # NaN filled with the previous value (4)
    7

----------------------
#Important Considerations:

Forward fill is a good option when you assume that the missing values are likely similar to the previous observation.

If the first value in the column is NaN, it will remain NaN as there is no previous value to use for filling.

If you want to fill NaNs with a specific value instead of using forward fill, you can provide that value directly to fillna, like fillna(0).

To modify the original DataFrame, you need to assign the result back to the column:

    df['number_of_fires'] = df['number_of_fires'].fillna(method='ffill')


In [None]:
# View index 444 to see how it changes
# Homework, change 444 using ffill and backfill to see how it changes
df.iloc[444]

year                   2002
month              Novembro
state               alagoas
number_of_fires         NaN
date               1/1/2002
Name: 444, dtype: object

In [29]:
df.iloc[445]

Unnamed: 0,445
year,2003
month,Novembro
state,alagoas
number_of_fires,17
date,1/1/2003


# Accesses a specific row using integer-based indexing.

It selects the row with the index label 445.

------------------------
##df:

Refers to your pandas DataFrame.

##.iloc:
This is an attribute of the DataFrame used for integer-location based indexing.

It allows you to select rows and columns using their integer positions.

##[445]

This is the integer index specifying the row you want to access. In this case, it selects the row with the index label 445.
In simpler terms:

##df.iloc[445]

retrieves the entire row of data corresponding to the 445th position in your DataFrame.

-------------------------
##Zero-based indexing:
Pandas uses zero-based indexing, which means the first row has an index of 0, the second row has an index of 1, and so on.

Therefore, df.iloc[445] will retrieve the 446th row of the DataFrame.
Row label vs. position: iloc works based on the position of the row in the DataFrame, not the row labels.

If your DataFrame has custom row labels (e.g., strings or dates), iloc will still use the integer position to select the row.

##Result
The output of df.iloc[445] will be a pandas Series representing the selected row. Each element of the Series will correspond to a column in your DataFrame.
Example:

If your DataFrame df has columns 'A', 'B', and 'C', and you execute df.iloc[445], the output might look like this:

    A    value1
    B    value2
    C    value3

##Name: 445, dtype: object

This shows the values of columns 'A', 'B', and 'C' for the row with index label 445.

---------------
#Alternative

If you want to access a row using its label instead of its position, you can use df.loc[label].

For example, if your rows are labeled with strings, you could use df.loc['row_label'] to access a specific row.

In [30]:
# let's make the assumption that blank values are 0 fires

# let's get back our copy of our original pre-processed datafrmae
df = df_copy

# replace all missing or NaN values with 0
df['number_of_fires'] = df['number_of_fires'].fillna(0)

In [31]:
# Let's check to see if we did change our Nans to 0s
df.iloc[444]

Unnamed: 0,444
year,2002
month,Novembro
state,alagoas
number_of_fires,0
date,1/1/2002


# Assigning data types to our columns

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6454 entries, 0 to 6453
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   year             6454 non-null   int64 
 1   month            6454 non-null   object
 2   state            6454 non-null   object
 3   number_of_fires  6454 non-null   object
 4   date             6454 non-null   object
dtypes: int64(1), object(4)
memory usage: 252.2+ KB


In [33]:
df["number_of_fires"] = df["number_of_fires"].str.replace('','0').astype(float)

In [34]:
df.head()

Unnamed: 0,year,month,state,number_of_fires,date
0,1998,Janeiro,Acre,0.0,1/1/1998
1,1999,Janeiro,Acre,0.0,1/1/1999
2,2000,Janeiro,Acre,0.0,1/1/2000
3,2001,Janeiro,Acre,0.0,1/1/2001
4,2002,Janeiro,Acre,0.0,1/1/2002


In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6454 entries, 0 to 6453
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   year             6454 non-null   int64  
 1   month            6454 non-null   object 
 2   state            6454 non-null   object 
 3   number_of_fires  6322 non-null   float64
 4   date             6454 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 252.2+ KB


In [36]:
df['month'].unique()

array(['Janeiro', 'Fevereiro', 'Março', 'Abril', 'Maio', 'Junho', 'Julho',
       'Agosto', 'Setembro', 'Outubro', 'Novembro', 'Dezembro'],
      dtype=object)

# Replacing text in columns

In [37]:
# Let's convert our Portuguese month names to English

month_translations = {'Janeiro': 'January',
'Fevereiro': 'February',
'Março': 'March',
'Abril': 'April',
'Maio': 'May',
'Junho': 'June',
'Julho': 'July',
'Agosto': 'August',
'Setembro': 'September',
'Outubro': 'October',
'Novembro': 'November',
'Dezembro': 'December'}

df["month"] = df["month"].map(month_translations)
df.head()

Unnamed: 0,year,month,state,number_of_fires,date
0,1998,January,Acre,0.0,1/1/1998
1,1999,January,Acre,0.0,1/1/1999
2,2000,January,Acre,0.0,1/1/2000
3,2001,January,Acre,0.0,1/1/2001
4,2002,January,Acre,0.0,1/1/2002


In [38]:
df.isnull().sum()

Unnamed: 0,0
year,0
month,0
state,0
number_of_fires,132
date,0


# Further string functions on columns

In [39]:
df['state'] = df['state'].str.title()
df['state'].unique()

array(['Acre', 'Alagoas', 'Amapa', 'Amazonas', 'Bahia', 'Ceara',
       'Distrito Federal', 'Espirito Santo', 'Goias', 'Maranhao',
       'Mato Grosso', 'Minas Gerais', 'Pará', 'Paraiba', 'Pernambuco',
       'Piau', 'Rio', 'Rondonia', 'Roraima', 'Santa Catarina',
       'Sao Paulo', 'Sergipe', 'Tocantins'], dtype=object)

# Removing columns

In [40]:
df.head()

Unnamed: 0,year,month,state,number_of_fires,date
0,1998,January,Acre,0.0,1/1/1998
1,1999,January,Acre,0.0,1/1/1999
2,2000,January,Acre,0.0,1/1/2000
3,2001,January,Acre,0.0,1/1/2001
4,2002,January,Acre,0.0,1/1/2002


In [41]:
# Dropping multiple columns
df = df.drop("date", axis=1) # axis = 1 so that it works across our columns
df.head()

Unnamed: 0,year,month,state,number_of_fires
0,1998,January,Acre,0.0
1,1999,January,Acre,0.0
2,2000,January,Acre,0.0
3,2001,January,Acre,0.0
4,2002,January,Acre,0.0


In [42]:
# Let's reload the data

# Let's reload our dataframe
file_name = "https://raw.githubusercontent.com/rajeevratan84/datascienceforbusiness/master/amazon_fires.csv"
df = pd.read_csv(file_name, encoding = "ISO-8859-1")
new_columns = {'ano' : 'year',
               'estado': 'state',
               'mes': 'month',
               'numero': 'number_of_fires',
               'encontro': 'date'}
df.rename(columns = new_columns, inplace=True)
df['number_of_fires'] = df['number_of_fires'].str.strip(" Fires")
df_copy = df.copy()
df.head()

Unnamed: 0,year,month,state,number_of_fires,date
0,1998,Janeiro,Acre,0,1/1/1998
1,1999,Janeiro,Acre,0,1/1/1999
2,2000,Janeiro,Acre,0,1/1/2000
3,2001,Janeiro,Acre,0,1/1/2001
4,2002,Janeiro,Acre,0,1/1/2002


In [43]:
# Drop multiple columns
df = df.drop(["year", "date"], axis=1)
df.head()

Unnamed: 0,month,state,number_of_fires
0,Janeiro,Acre,0
1,Janeiro,Acre,0
2,Janeiro,Acre,0
3,Janeiro,Acre,0
4,Janeiro,Acre,0


# Dropping Rows

Using the df.index function

In [44]:
# Let's drop the first row
df = df.drop(df.index[0])
df = df.reset_index()
df.head()

Unnamed: 0,index,month,state,number_of_fires
0,1,Janeiro,Acre,0
1,2,Janeiro,Acre,0
2,3,Janeiro,Acre,0
3,4,Janeiro,Acre,0
4,5,Janeiro,Acre,10


In [45]:
# Drop multiple rows

df = df.drop(df.index[[2,3]])
df.head()

Unnamed: 0,index,month,state,number_of_fires
0,1,Janeiro,Acre,0
1,2,Janeiro,Acre,0
4,5,Janeiro,Acre,10
5,6,Janeiro,Acre,0
6,7,Janeiro,Acre,12


In [46]:
# Drop a range of rows

df = df.drop(df.index[1:4])
df.head()

Unnamed: 0,index,month,state,number_of_fires
0,1,Janeiro,Acre,0
6,7,Janeiro,Acre,12
7,8,Janeiro,Acre,4
8,9,Janeiro,Acre,0
9,10,Janeiro,Acre,0
