# Introduction 

<div><p>So far, we've learned how to select, assign, and analyze data with pandas using pre-cleaned data. In reality, data is rarely in the format needed to perform analysis. Data scientists commonly spend <a href="https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/" target="_blank">over half their time cleaning data</a>, so knowing how to clean "messy" data is an extremely important skill.</p>
<p>In this mission, we'll learn the basics of data cleaning with pandas as we work with <code>laptops.csv</code>, a CSV file containing information about 1,300 laptop computers. The first five rows of the CSV file are shown below:</p>
<table class="dataframe">
<thead>
<tr>
<th></th>
<th>Manufacturer</th>
<th>Model Name</th>
<th>Category</th>
<th>Screen Size</th>
<th>Screen</th>
<th>CPU</th>
<th>RAM</th>
<th>Storage</th>
<th>GPU</th>
<th>Operating System</th>
<th>Operating System Version</th>
<th>Weight</th>
<th>Price (Euros)</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Apple</td>
<td>MacBook Pro</td>
<td>Ultrabook</td>
<td>13.3"</td>
<td>IPS Panel Retina Display 2560x1600</td>
<td>Intel Core i5 2.3GHz</td>
<td>8GB</td>
<td>128GB SSD</td>
<td>Intel Iris Plus Graphics 640</td>
<td>macOS</td>
<td>NaN</td>
<td>1.37kg</td>
<td>1339,69</td>
</tr>
<tr>
<th>1</th>
<td>Apple</td>
<td>Macbook Air</td>
<td>Ultrabook</td>
<td>13.3"</td>
<td>1440x900</td>
<td>Intel Core i5 1.8GHz</td>
<td>8GB</td>
<td>128GB Flash Storage</td>
<td>Intel HD Graphics 6000</td>
<td>macOS</td>
<td>NaN</td>
<td>1.34kg</td>
<td>898,94</td>
</tr>
<tr>
<th>2</th>
<td>HP</td>
<td>250 G6</td>
<td>Notebook</td>
<td>15.6"</td>
<td>Full HD 1920x1080</td>
<td>Intel Core i5 7200U 2.5GHz</td>
<td>8GB</td>
<td>256GB SSD</td>
<td>Intel HD Graphics 620</td>
<td>No OS</td>
<td>NaN</td>
<td>1.86kg</td>
<td>575,00</td>
</tr>
<tr>
<th>3</th>
<td>Apple</td>
<td>MacBook Pro</td>
<td>Ultrabook</td>
<td>15.4"</td>
<td>IPS Panel Retina Display 2880x1800</td>
<td>Intel Core i7 2.7GHz</td>
<td>16GB</td>
<td>512GB SSD</td>
<td>AMD Radeon Pro 455</td>
<td>macOS</td>
<td>NaN</td>
<td>1.83kg</td>
<td>2537,45</td>
</tr>
<tr>
<th>4</th>
<td>Apple</td>
<td>MacBook Pro</td>
<td>Ultrabook</td>
<td>13.3"</td>
<td>IPS Panel Retina Display 2560x1600</td>
<td>Intel Core i5 3.1GHz</td>
<td>8GB</td>
<td>256GB SSD</td>
<td>Intel Iris Plus Graphics 650</td>
<td>macOS</td>
<td>NaN</td>
<td>1.37kg</td>
<td>1803,60</td>
</tr>
</tbody>
</table>
<p>We can start by reading the data into pandas. Let's look at what happens when we use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html" target="_blank"><code>pandas.read_csv()</code> function</a> with only the filename argument:</p>
</div>

```
laptops = pd.read_csv("laptops.csv")
```
```
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 4: invalid continuation byte
```

<div>
<p>We get an error! (The error message has been shortened.) This error references UTF-8, which is a type of <strong>encoding</strong>. Computers, at their lowest levels, can only understand binary - <code>0</code> and <code>1</code>- and encodings are systems for representing characters in binary. </p>
<p>Something we can do if our file has an unknown encoding is to try the most common encodings:</p>
<ul>
<li>UTF-8</li>
<li>Latin-1 (also known as ISO-8859-1)</li>
<li>Windows-1251</li>
</ul>
<p>The <code>pandas.read_csv()</code> function has an <code>encoding</code> argument we can use to specify an encoding:</p>
</div>

```
df = pd.read_csv("filename.csv", encoding="some_encoding")
```

<div>
<p>Since the <code>pandas.read_csv()</code> function already tried to read in the file with UTF-8 and failed, we know the file's not encoded with that format. Let's try the next most popular encoding in the exercise.</p></div>

### Instructions 

<ol>
<li>Import the pandas library</li>
<li>Use the <code>pandas.read_csv()</code> function to read the <code>laptops.csv</code> file into a dataframe <code>laptops</code>.<ul>
<li>Specify the encoding using the string <code>"Latin-1"</code>.</li>
</ul>
</li>
<li>Use the <code>DataFrame.info()</code> method to display information about the <code>laptops</code> dataframe.</li>
</ol>

In [1]:
import pandas as pd 
# Error 
laptops = pd.read_csv("laptops.csv")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 4: invalid continuation byte

In [3]:
import pandas as pd 
laptops = pd.read_csv("laptops.csv", encoding="Latin-1")
laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Manufacturer              1303 non-null   object
 1   Model Name                1303 non-null   object
 2   Category                  1303 non-null   object
 3   Screen Size               1303 non-null   object
 4   Screen                    1303 non-null   object
 5   CPU                       1303 non-null   object
 6   RAM                       1303 non-null   object
 7    Storage                  1303 non-null   object
 8   GPU                       1303 non-null   object
 9   Operating System          1303 non-null   object
 10  Operating System Version  1133 non-null   object
 11  Weight                    1303 non-null   object
 12  Price (Euros)             1303 non-null   object
dtypes: object(13)
memory usage: 132.5+ KB


# Cleaning column names 

<div><p>Below is the output of the <code>DataFrame.info()</code> method from the previous screen:</p>
</div>

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Manufacturer              1303 non-null   object
 1   Model Name                1303 non-null   object
 2   Category                  1303 non-null   object
 3   Screen Size               1303 non-null   object
 4   Screen                    1303 non-null   object
 5   CPU                       1303 non-null   object
 6   RAM                       1303 non-null   object
 7    Storage                  1303 non-null   object
 8   GPU                       1303 non-null   object
 9   Operating System          1303 non-null   object
 10  Operating System Version  1133 non-null   object
 11  Weight                    1303 non-null   object
 12  Price (Euros)             1303 non-null   object
dtypes: object(13)
memory usage: 132.5+ KB
```

<div>
<p>We can see that every column is represented as the <code>object</code> type, indicating that they are represented by strings, not numbers. Also, one of the columns, <code>Operating System Version</code>, has null values. </p>
<p>The column labels have a variety of upper and lowercase letters, as well as spaces and parentheses, which will make them harder to work with and read. One noticeable issue is that the <code>" Storage"</code> column name has a space in front of it. These quirks with column labels can sometimes be hard to spot, so removing extra whitespaces from all column names will save us more work in the long run.</p>
<p>We can access the column axis of a dataframe using the <a href="https://pandas.pydata.org/pandas-docs/stable/basics.html#attributes-and-the-raw-ndarray-s" target="_blank"><code>DataFrame.columns</code> attribute</a>. This returns an index object — a special type of NumPy ndarray — with the labels of each column:</p>
</div>

```
print(laptops.columns)
```
```
Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', ' Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')
```

<div>
<p>Not only can we use the attribute to view the column labels, we can also assign new labels to the attribute:</p>
</div>

```
laptops_test = laptops.copy()
laptops_test.columns = ['A', 'B', 'C', 'D', 'E',
                        'F', 'G', 'H', 'I', 'J',
                        'K', 'L', 'M']
print(laptops_test.columns)
```
```
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M'], dtype='object')
```

<div>
<p>Next, let's use the <code>DataFrame.columns</code> attribute to remove whitespaces from the column names.</p></div>

### Instructions 

<ol>
<li>Remove any whitespace from the start and end of each column name.<ul>
<li>Create an empty list named <code>new_columns</code>.</li>
<li>Use a for loop to iterate through each column name using the <code>DataFrame.columns</code> attribute. Inside the body of the for loop:<ul>
<li>Use the <a href="https://docs.python.org/3.6/library/stdtypes.html#str.strip" target="_blank"><code>str.strip()</code> method</a> to remove whitespace from the start and end of the string.</li>
<li>Append the updated column name to the <code>new_columns</code> list.</li>
</ul>
</li>
<li>Assign the updated column names to the <code>DataFrame.columns</code> attribute.</li>
</ul>
</li>
</ol>

In [4]:
new_columns = []

for c in laptops.columns:
    new_columns.append(c.strip())

laptops.columns = new_columns

<div><p>In the last exercise, we removed whitespaces from the column names. Below is the result:</p>
</div>

```
Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', 'Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')
```

<div>
<p>However, the column labels still have a variety of upper and lowercase letters, as well as parentheses, which will make them harder to work with and read. Let's finish cleaning our column labels by:</p>
<ul>
<li>Replacing spaces with underscores.</li>
<li>Removing special characters.</li>
<li>Making all labels lowercase.</li>
<li>Shortening any long column names.</li>
</ul>
<p>We can create a function that uses <a href="https://docs.python.org/3/library/stdtypes.html#string-methods" target="_blank">Python string methods</a> to clean our column labels, and then again use a loop to apply that function to each label. Let's look at an example:</p>
</div>

```
def clean_col(col):
    col = col.strip()
    col = col.replace("(","")
    col = col.replace(")","")
    col = col.lower()
    return col

new_columns = []
for c in laptops.columns:
    clean_c = clean_col(c)
    new_columns.append(clean_c)

laptops.columns = new_columns
print(laptops.columns)
```
```
Index(['manufacturer', 'model name', 'category', 'screen size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'operating system',
       'operating system version', 'weight', 'price euros'],
      dtype='object')
```

<div>
<p>Our code:</p>
<ul>
<li>Defined a function, which:<ul>
<li>Used the <a href="https://docs.python.org/3.6/library/stdtypes.html#str.strip" target="_blank"><code>str.strip()</code> method</a> to remove whitespace from the start and end of the string.</li>
<li>Used the <a href="https://docs.python.org/3.6/library/stdtypes.html#str.replace" target="_blank"><code>str.replace()</code> method</a> to remove parentheses from the string.</li>
<li>Used the <a href="https://docs.python.org/3.6/library/stdtypes.html#str.lower" target="_blank"><code>str.lower()</code> method</a> to make the string lowercase.</li>
<li>Returns the modified string.</li>
</ul>
</li>
<li>Used a loop to apply the function to each item in the index object and assign it back to the <code>DataFrame.columns</code> attribute.</li>
<li>Printed the new values for the <code>DataFrame.columns</code> attribute.</li>
</ul>
<p>Let's use this technique to clean the column labels in our dataframe, adding a few extra cleaning 'chores' along the way.</p></div>

### Instructions 

<ol>
<li>Define a function, which accepts a string argument, and:<ul>
<li>Removes any whitespace from the start and end of the string.</li>
<li>Replaces the substring <code>Operating System</code> with the abbreviation <code>os</code>.</li>
<li>Replaces all spaces with underscores.</li>
<li>Removes parentheses from the string.</li>
<li>Makes the entire string lowercase.</li>
<li>Returns the modified string.</li>
</ul>
</li>
<li>Use a loop to apply the function to each item in the <code>DataFrame.columns</code> attribute for the <code>laptops</code> dataframe. Assign the result back to the <code>DataFrame.columns</code> attribute.</li>
</ol>

In [5]:
import pandas as pd
laptops = pd.read_csv('laptops.csv', encoding='Latin-1')

def clean_col(col_name):
    col = col_name.strip()
    col = col.replace("Operating System", "os")
    col = col.replace(" ", "_")
    col = col.replace("(", "")
    col = col.replace(")", "")
    col = col.lower()
    
    return col 

new_columns = []
for c in laptops.columns:
    new_columns.append(clean_col(c))

laptops.columns = new_columns


# Converting string columns to numeric 

<div><p>We observed earlier that all 13 columns have the <code>object</code> dtype, meaning they're stored as strings.  Let's look at the first few rows of some of our columns:</p>
</div>

```
print(laptops.iloc[:5,2:5])
```
```
_   category screen_size                              screen
0  Ultrabook       13.3"  IPS Panel Retina Display 2560x1600
1  Ultrabook       13.3"                            1440x900
2   Notebook       15.6"                   Full HD 1920x1080
3  Ultrabook       15.4"  IPS Panel Retina Display 2880x1800
4  Ultrabook       13.3"  IPS Panel Retina Display 2560x1600
```

<div>
<p>Of these three columns, we have three different types of text data:</p>
<ul>
<li><code>category</code>: Purely text data - there are no numeric values.</li>
<li><code>screen_size</code>: Numeric data stored as text data because of the <code>"</code> character.</li>
<li><code>screen</code>: A combination of pure text data with numeric data.</li>
</ul>
<p>Because the values in the <code>screen_size</code> column are stored as text data, we can't sort them numerically. For instance, if we wanted to select laptops with screens 15" or larger, we'd be unable to do so. </p>
<p>Let's convert the <code>screen_size</code> column to numeric next. Whenever we convert text to numeric data, we can follow this data cleaning workflow:</p>
<p><img src="https://s3.amazonaws.com/dq-content/293/cleaning_workflow.svg" alt="string to numeric cleaning workflow"></p>
<p>The first step is to <strong>explore the data</strong>.  One of the best ways to do this is to use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html" target="_blank"><code>Series.unique()</code> method</a> to view all of the unique values in the column:</p>
</div>

```
print(laptops["screen_size"].dtype)
print(laptops["screen_size"].unique())
```
```
object

['13.3"', '15.6"', '15.4"', '14.0"', '12.0"', '11.6"',
 '17.3"', '10.1"', '13.5"', '12.5"', '13.0"', '18.4"',
 '13.9"', '12.3"', '17.0"', '15.0"', '14.1"',
 '11.3"']
```

<div>
<p>Our next step is to <strong>identify patterns and special cases</strong>. We can observe the following:</p>
<ul>
<li>All values in this column follow the same pattern - a series of digit and period characters, followed by a quote character (<code>"</code>). </li>
<li>There are no special cases. Every value matches the same pattern.</li>
<li>We'll need to convert the column to a <code>float</code> dtype, as the <code>int</code> dtype won't be able to store the decimal values.</li>
</ul>
<p>Let's identify any patterns and special cases in the <code>ram</code> column next.</p></div>

### Instructions 

<ol>
<li>Use the <code>Series.unique()</code> method to identify the unique values in the <code>ram</code> column of the <code>laptops</code> dataframe. Assign the result to <code>unique_ram</code>.</li>
<li>After running your code, use the variable inspector to view the unique values in the <code>ram</code> column and identify any patterns.</li>
</ol>

In [6]:
unique_ram = laptops["ram"].unique()

# Removing non-digit pattern

<div><p>In the last exercise, we identified a clear pattern in the <code>ram</code> column - all values are integers and include the character <code>GB</code> at the end of the string:</p>
</div>

```
['8GB' '16GB' '4GB' '2GB' '12GB' '6GB' '32GB' '24GB' '64GB']
```

<div>
<p>To convert both the <code>ram</code> and <code>screen_size</code> columns to numeric dtypes, we'll have to first <strong>remove the non-digit characters</strong>.</p>
<p><img src="https://s3.amazonaws.com/dq-content/293/cleaning_workflow.svg" alt="string to numeric cleaning workflow"></p>
<p>The pandas library contains dozens of <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary" target="_blank">vectorized string methods</a> we can use to manipulate text data, many of which perform the same operations as Python string methods. Most vectorized string methods are available using the <a href="http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling" target="_blank"><code>Series.str</code> accessor</a>, which means we can access them by adding <code>str</code> between the series name and the method name:</p>
<p></p><center><img src="https://s3.amazonaws.com/dq-content/346/Syntax.png" alt="vectorized_string_methods"></center><p></p>
<p>In this case, we can use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html" target="_blank"><code>Series.str.replace()</code> method</a>, which is a vectorized version of the Python <code>str.replace()</code> method we used in the previous screen, to remove all the quote characters from every string in the <code>screen_size</code> column:</p>
</div>

```
laptops["screen_size"] = laptops["screen_size"].str.replace('"','')
print(laptops["screen_size"].unique())
```
```
['13.3', '15.6', '15.4', '14.0', '12.0', '11.6', '17.3',
 '10.1', '13.5', '12.5', '13.0', '18.4', '13.9', '12.3',
 '17.0', '15.0', '14.1', '11.3']
```

<div>
<p>Let's remove the non-digit characters from the <code>ram</code> column next.</p></div>

### Instructions 

<ol>
<li>Use the <code>Series.str.replace()</code> method to remove the substring <code>GB</code> from the <code>ram</code> column.</li>
<li>Use the <code>Series.unique()</code> method to assign the unique values in the <code>ram</code> column to <code>unique_ram</code>.</li>
<li>After running your code, use the variable inspector to verify your changes.</li>
</ol>

In [9]:
laptops["screen_size"] = laptops["screen_size"].str.replace('"','')

laptops["ram"] = laptops["ram"].str.replace("GB", "")
unique_ram = laptops["ram"].unique()

# Converting columns to numeric Dtypes

<div><p>In the last screen, we used the <code>Series.str.replace()</code> method to remove the non-digit characters from the <code>screen_size</code> and <code>ram</code> columns. Now, we can <strong>convert (or cast) the columns to a numeric dtype</strong>. </p>
<p><img src="https://s3.amazonaws.com/dq-content/293/cleaning_workflow.svg" alt="string to numeric cleaning workflow"></p>
<p>To do this, we use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.astype.html" target="_blank"><code>Series.astype()</code> method</a>. To convert the column to a numeric dtype, we can use either <code>int</code> or <code>float</code> as the parameter for the method. Since the <code>int</code> dtype can't store decimal values, we'll convert the <code>screen_size</code> column to the <code>float</code> dtype:</p>
</div>

```
laptops["screen_size"] = laptops["screen_size"].astype(float)
print(laptops["screen_size"].dtype)
print(laptops["screen_size"].unique())
```
```
float64

[13.3, 15.6, 15.4, 14. , 12. , 11.6, 17.3, 10.1, 13.5, 12.5,
 13. , 18.4, 13.9, 12.3, 17. , 15. , 14.1, 11.3]
```

<div>
<p>Our <code>screen_size</code> column is now the <code>float64</code> dtype. Let's convert the dtype of the <code>ram</code> column to numeric next.</p></div>

### Instructions 

<ol>
<li>Use the <code>Series.astype()</code> method to change the <code>ram</code> column to an <code>integer</code> dtype.</li>
<li>Use the <code>DataFrame.dtypes</code> attribute to get a list of the column names and types from the <code>laptops</code> dataframe. Assign the result to <code>dtypes</code>.</li>
<li>After running your code, use the variable inspector to view the <code>dtypes</code> variable to see the results of your code.</li>
</ol>

In [10]:
laptops["screen_size"] = laptops["screen_size"].astype(float)
print(laptops["screen_size"].dtype)

laptops["ram"] = laptops["ram"].str.replace('GB','')
laptops["ram"] = laptops["ram"].astype(int)
dtypes = laptops.dtypes

float64


# Renaming columns 

<div><p>Now that we've converted our columns to numeric dtypes, the final step is to <strong>rename the column</strong>. This is an optional step, and can be useful if the non-digit values contain information that helps us understand the data. </p>
<p><img src="https://s3.amazonaws.com/dq-content/293/cleaning_workflow.svg" alt="string to numeric cleaning workflow"></p>
<p>In our case, the quote characters we removed from the <code>screen_size</code> column denoted that the screen size was in inches. As a reminder, here's what the original values looked like:</p>
</div>

```
['13.3"', '15.6"', '15.4"', '14.0"', '12.0"', '11.6"',
 '17.3"', '10.1"', '13.5"', '12.5"', '13.0"', '18.4"',
 '13.9"', '12.3"', '17.0"', '15.0"', '14.1"',
 '11.3"']
```

<div>
<p>To stop us from losing information that helps us understand the data, we can use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html" target="_blank"><code>DataFrame.rename()</code> method</a> to rename the column from <code>screen_size</code> to <code>screen_size_inches</code>. </p>
<p>Below, we specify the <code>axis=1</code> parameter so pandas knows that we want to rename labels in the column axis:</p>
</div>

```
laptops.rename({"screen_size": "screen_size_inches"}, axis=1, inplace=True)
print(laptops.dtypes)
```
```
manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram                    object
storage                object
gpu                    object
os                     object
os_version             object
weight                 object
price_euros            object
dtype: object
```

<div>
<p>Note that we can either use <code>inplace=True</code> or assign the result back to the dataframe - both will give us the same results.</p>
<p>Let's rename the <code>ram</code> column next and analyze the results.</p></div>

### Instructions 

<ol>
<li>Because the <code>GB</code> characters contained useful information about the units (gigabytes) of the laptop's ram, use the <code>DataFrame.rename()</code> method to rename the column from <code>ram</code> to <code>ram_gb</code>.</li>
<li>Use the <code>Series.describe()</code> method to return a series of descriptive statistics for the <code>ram_gb</code> column. Assign the result to <code>ram_gb_desc</code>.</li>
<li>After you have run your code, use the variable inspector to see the results of your code.</li>
</ol>

In [12]:
laptops.rename({"screen_size": "screen_size_inches"}, axis=1, inplace=True)

laptops.rename({"ram": "ram_gb"}, axis=1, inplace=True)
ram_gb_desc = laptops["ram_gb"].describe()

# Extracting values from strings

<div><p>Sometimes, it can be useful to extract non-numeric values from within strings. Let's look at the first five values from the <code>gpu</code> (graphics processing unit) column:</p>
</div>

```
print(laptops["gpu"].head())
```
```
0    Intel Iris Plus Graphics 640
1          Intel HD Graphics 6000
2           Intel HD Graphics 620
3              AMD Radeon Pro 455
4    Intel Iris Plus Graphics 650
Name: gpu, dtype: object
```

<div>
<p>The information in this column seems to be a manufacturer (Intel, AMD) followed by a model name/number. Let's extract the manufacturer by itself so we can find the most common ones.</p>
<p>Because each manufacturer is followed by a whitespace character, we can use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html" target="_blank"><code>Series.str.split()</code> method</a> to extract this data:</p>
<p><img src="https://s3.amazonaws.com/dq-content/293/str_split_2.svg" alt="extracting data from a string, step 2"></p>
<p>This method splits each string on the whitespace; the result is a series containing individual Python lists. Also note that we used parentheses to method chain over multiple lines, which makes our code easier to read.</p>
<p>Just like with lists and ndarrays, we can use bracket notation to access the elements in each list in the series. With series, however, we use the <code>str</code> accessor followed by <code>[]</code> (brackets):</p>
</div>

```
print(laptops["gpu"].head().str.split().str[0])
```

<div>
<p>Above, we used <code>0</code> to select the <em>first</em> element in each list. Below is the result:</p>
</div>

```
0    Intel
1    Intel
2    Intel
3      AMD
4    Intel
Name: gpu, dtype: object
```

<div>
<p>Let's use this technique to extract the manufacturer from the <code>cpu</code> column as well. Here are the first 5 rows of the <code>cpu</code> column:</p>
</div>

```
print(laptops["cpu"].head())
```
```
0          Intel Core i5 2.3GHz
1          Intel Core i5 1.8GHz
2    Intel Core i5 7200U 2.5GHz
3          Intel Core i7 2.7GHz
4          Intel Core i5 3.1GHz
Name: cpu, dtype: object
```

### Instructions 

<p>In the example code, we have extracted the manufacturer name from the <code>gpu</code> column, and assigned it to a new column <code>gpu_manufacturer</code>.</p>

<ol>
<li>Extract the manufacturer name from the <code>cpu</code> column. Assign it to a new column <code>cpu_manufacturer</code>.</li>
<li>Use the <code>Series.value_counts()</code> method to find the counts of each manufacturer in <code>cpu_manufacturer</code>. Assign the result to <code>cpu_manufacturer_counts</code>.</li>
</ol>



In [13]:
laptops["gpu_manufacturer"] = (laptops["gpu"]
                                       .str.split()
                                       .str[0]
                              )
laptops["cpu_manufacturer"] = (laptops["cpu"]
                                       .str.split()
                                       .str[0])
cpu_manufacturer_counts = laptops["cpu_manufacturer"].value_counts()


# Correcting bad values 

<div><p>If your data has been scraped from a webpage or if there was manual data entry involved at some point, you may end up with inconsistent values. Let's look at an example from our <code>os</code> column:</p>
</div>

```
print(laptops["os"].value_counts())
```
```
Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          13
Mac OS          8
Android         2
Name: os, dtype: int64
```

<div>
<p>We can see that there are two variations of the Apple operating system — macOS —&nbsp;in our dataset: <code>Mac OS</code> and <code>macOS</code>. One way we can fix this is with the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html" target="_blank"><code>Series.map()</code> method</a>. The <code>Series.map()</code> method is ideal when we want to change multiple values in a column, but we'll use it now as an opportunity to learn how the method works.</p>
<p>The most common way to use <code>Series.map()</code> is with a dictionary. Let's look at an example using a series of misspelled fruit:</p>
</div>

```
print(s)
```
```
0       pair
1     oranje
2    bananna
3     oranje
4     oranje
5     oranje
dtype: object
```

<div>
<p>We'll create a dictionary called <code>corrections</code> and pass that dictionary as an argument to <code>Series.map()</code>:</p>
</div>

```
corrections = {
    "pair": "pear",
    "oranje": "orange",
    "bananna": "banana"
}
s = s.map(corrections)
print(s)
```
```
0       pear
1     orange
2     banana
3     orange
4     orange
5     orange
dtype: object
```

<div>
<p>We can see that each of our corrections were made across our series. One important thing to remember with <code>Series.map()</code> is that if a value from your series doesn't exist as a key in your dictionary, it will convert that value to <code>NaN</code>. Let's see what happens when we run map one more time:</p>
</div>

```
s = s.map(corrections)
print(s)
```
```
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
dtype: object
```

<div>
<p>Because none of the corrected values in our series existed as keys in our dictionary, all values became <code>NaN</code>! It's a very common occurence, especially when working in Jupyter notebook, where you can easily re-run cells.</p>
<p>Let's use <code>Series.map()</code> to clean the values in the <code>os</code> column.</p></div>


### Instructions 


<p>We have created a dictionary for you to use with mapping.  Note that we have included both the correct and incorrect spelling of macOS as keys, otherwise we'll end up with null values.</p>

<ol>
<li>Use the <code>Series.map()</code> method with the <code>mapping_dict</code> dictionary to correct the values in the <code>os</code> column.</li>
</ol>

In [15]:
mapping_dict = {
    'Android': 'Android',
    'Chrome OS': 'Chrome OS',
    'Linux': 'Linux',
    'Mac OS': 'macOS',
    'No OS': 'No OS',
    'Windows': 'Windows',
    'macOS': 'macOS'
}

laptops["os"] = laptops["os"].map(mapping_dict)

# Dropping missing values 

<div><p>In previous missions, we've talked briefly about missing values and how both NumPy and pandas represent these as null values. In pandas, null values will be indicated by either <code>NaN</code> or <code>None</code>.</p>
<p>Recall that we can use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html" target="_blank"><code>DataFrame.isnull()</code> method</a> to identify  missing values, which returns a boolean dataframe. We can then use the <code>DataFrame.sum()</code> method to give us a count of the <code>True</code> values for each column:</p>
</div>

```
print(laptops.isnull().sum())
```
```
manufacturer            0
model_name              0
category                0
screen_size_inches      0
screen                  0
cpu                     0
ram_gb                  0
storage                 0
gpu                     0
os                      0
os_version            170
weight_kg               0
price_euros             0
cpu_manufacturer        0
screen_resolution       0
cpu_speed               0
dtype: int64
```

<div>
<p>It's now clear that we have only one column with null values, <code>os_version</code>, which has 170 missing values.</p>
<p>There are a few options for handling missing values:</p>
<ul>
<li>Remove any rows that have missing values.</li>
<li>Remove any columns that have missing values.</li>
<li>Fill the missing values with some other value.</li>
<li>Leave the missing values as is.</li>
</ul>
<p>The first two options are often used to prepare data for machine learning algorithms, which are unable to be used with data that includes null values. We can use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html" target="_blank"><code>DataFrame.dropna()</code> method</a> to remove or <strong>drop</strong> rows and columns with null values. </p>
<p>The <code>DataFrame.dropna()</code> method accepts an <code>axis</code> parameter, which indicates whether we want to drop along the column or index axis. Let's look at an example:</p>
<p><img src="https://s3.amazonaws.com/dq-content/293/dropna_1.svg" alt="removing missing values example dataframe"></p>
<p>The default value for the <code>axis</code> parameter is <code>0</code>, so <code>df.dropna()</code> returns an identical result to <code>df.dropna(axis=0</code>):</p>
<p><img src="https://s3.amazonaws.com/dq-content/293/dropna_2.svg" alt="removing missing values axis=0"></p>
<p>The rows with labels <code>x</code> and <code>z</code> contain null values, so those rows are dropped. Let's look at what happens when we use <code>axis=1</code> to specify the column axis:</p>
<p><img src="https://s3.amazonaws.com/dq-content/293/dropna_3.svg" alt="removing missing values axis=1"></p>
<p>Only the column with label <code>C</code> contains null values, so, in this case, just one column is removed.</p>
<p>Let's practice using <code>DataFrame.dropna()</code> to remove rows and columns:</p></div>

### Instructions 

<ol>
<li>Use <code>DataFrame.dropna()</code> to remove any rows from the laptops dataframe that have null values. Assign the result to <code>laptops_no_null_rows</code>.</li>
<li>Use <code>DataFrame.dropna()</code> to remove any columns from the laptops dataframe that have null values. Assign the result to <code>laptops_no_null_cols</code>.</li>
</ol>

In [16]:
laptops_no_null_rows = laptops.dropna()

laptops_no_null_cols = laptops.dropna(axis=1)

# Filling missing values 

<div><p>In the previous screen, we learned there are various ways to deal with missing values:</p>
<ul>
<li>Remove any rows that have missing values.</li>
<li>Remove any columns that have missing values.</li>
<li>Fill the missing values with some other value.</li>
<li>Leave the missing values as is.</li>
</ul>
<p>While dropping rows or columns is the easiest approach to deal with missing values, it may not always be the <em>best</em> approach. For example, removing a disproportionate amount of one manufacturer's laptops could change our analysis.</p>
<p>Because of this, it's a good idea to explore the missing values in the <code>os_version</code> column before making a decision. We can use <code>Series.value_counts()</code> to explore all of the values in the column, but we'll use a parameter we haven't seen before:</p>
</div>

```
print(laptops["os_version"].value_counts(dropna=False))
```
```
10      1072
NaN      170
7         45
X          8
10 S       8
Name: os_version, dtype: int64
```

<div>
<p>Because we set the <code>dropna</code> parameter to <code>False</code>, the result includes null values. We can see that the majority of values in the column are <code>10</code> and missing values are the next most common.</p>
<p>Let's also explore the <code>os</code> column, since it's is closely related to the <code>os_version</code> column. We'll only look at rows in which the <code>os_version</code> is missing:</p>
</div>

```
os_with_null_v = laptops.loc[laptops["os_version"].isnull(),"os"]
print(os_with_null_v.value_counts())
```
```
No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: os, dtype: int64
```

<div>
<p>Immediately, we can observe a few things:</p>
<ul>
<li>The most frequent value is "No OS". This is important to note because if there is no os, there <em>shouldn't</em> be a version defined in the <code>os_version</code> column.</li>
<li>Thirteen of the laptops that come with macOS do not specify the version. We can use our knowledge of <a href="https://en.wikipedia.org/wiki/MacOS" target="_blank">MacOS</a> to confirm that <code>os_version</code> should be equal to <code>X</code>.</li>
</ul>
<p>In both of these cases, we can fill the missing values to make our data more correct. For the rest of the values, it's probably best to leave them as missing so we don't remove important values.</p>
<p>We can use assignment with a boolean comparison to perform this replacement, like below:</p>
</div>

```
laptops.loc[laptops["os"] == "macOS", "os_version"] = "X"
```

<div>
<p>For rows with <code>No OS</code> values, let's replace the missing value in the <code>os_version</code> column with the value <code>Version Unknown</code>.</p></div>

### Instructions 

<ol>
<li>Use a boolean array to identify rows that have the value <code>No OS</code> for the <code>os</code> column. Then, use assignment to assign the value <code>Version Unknown</code> to the <code>os_version</code> column for those rows.</li>
<li>
<p>Use the syntax below to create <code>value_counts_after</code> variable:</p>
<p><code>value_counts_after = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()</code></p>
</li>
<li>
<p>After running your code, use the variable inspector to look at the difference between <code>value_counts_before</code> and <code>value_counts_after</code>.</p>
</li>
</ol>

In [17]:
value_counts_before = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()
laptops.loc[laptops["os"] == "macOS", "os_version"] = "X"

laptops.loc[laptops["os"] == "No OS", "os_version"] = "Version Unknown"
value_counts_after = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()

# Challenge: Clean a string column 

<div><p>Now it's time to practice what we've learned so far! In this challenge, we'll clean the <code>weight</code> column. Let's look at a sample of the data in that column:</p>
</div>

```
print(laptops["weight"].head())
```
```
0    1.37kg
1    1.34kg
2    1.86kg
3    1.83kg
4    1.37kg
Name: Weight, dtype: object
```

<div>
<p>Your challenge is to convert the values in this column to numeric values. As a reminder, here's the data cleaning workflow you can use:</p>
<p><img src="https://s3.amazonaws.com/dq-content/293/cleaning_workflow.svg" alt="string to numeric cleaning workflow"></p>
<p>While it appears that the <code>weight</code> column may just need the <code>kg</code> characters removed from the end of each string, there is one special case - one of the values ends with <code>kgs</code>, so you'll have to remove both <code>kg</code> and <code>kgs</code> characters.</p>
<p>In the last step of this challenge, we'll also ask you to use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html" target="_blank"><code>DataFrame.to_csv()</code> method</a> to save the cleaned data to a CSV file. It's a good idea to save a CSV when you finish cleaning in case you wish to do analysis later.</p>
<p>We can use the following syntax to save a CSV:</p>
</div>

```
df.to_csv('filename.csv', index=False)
```

<div>
<p>By default, pandas will save the index labels as a column in the CSV file. Our dataset has integer labels that don't contain any data, so we don't need to save the index.</p>
<p>Don't be discouraged if this challenge takes a few attempts to get correct. Working iteratively is a great way to work, and this challenge is more difficult than exercises you have previously completed. We have included some extra hints, but we encourage you to try without the hints first; only use them if you need them!</p></div>

### Instructions 

<ol>
<li>Convert the values in the <code>weight</code> column to numeric values.</li>
<li>Rename the <code>weight</code> column to <code>weight_kg</code>.</li>
<li>Use the <code>DataFrame.to_csv()</code> method to save the laptops dataframe to a CSV file <code>laptops_cleaned.csv</code> <em>without</em> index labels.</li>
</ol>

In [18]:
print(laptops["weight"].value_counts(dropna=False))

laptops["weight"] = laptops["weight"].str.replace("kgs", "")
laptops["weight"] = laptops["weight"].str.replace("kg", "")

laptops["weight"] = laptops["weight"].astype(float)

laptops.rename({"weight" : "weight_kg"}, axis=1, inplace=True)

laptops.to_csv("laptops_cleaned.csv", index=False)


2.2kg     121
2.1kg      58
2.4kg      44
2.3kg      41
2.5kg      38
         ... 
0.99kg      1
2.21kg      1
1.55kg      1
1.79kg      1
4.4kg       1
Name: weight, Length: 179, dtype: int64
