The pandas.read_csv() function has an encoding argument we can use to specify an encoding:

df = pd.read_csv("filename.csv", encoding="some_encoding")

Import the pandas library
Use the pandas.read_csv() function to read the laptops.csv file into a dataframe laptops.
Specify the encoding using the string "Latin-1".
Use the DataFrame.info() method to display information about the laptops dataframe.

In [1]:
import numpy as np
import pandas as pd

laptops = pd.read_csv('laptops.csv', encoding = 'latin-1')

laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Manufacturer              1303 non-null   object
 1   Model Name                1303 non-null   object
 2   Category                  1303 non-null   object
 3   Screen Size               1303 non-null   object
 4   Screen                    1303 non-null   object
 5   CPU                       1303 non-null   object
 6   RAM                       1303 non-null   object
 7    Storage                  1303 non-null   object
 8   GPU                       1303 non-null   object
 9   Operating System          1303 non-null   object
 10  Operating System Version  1133 non-null   object
 11  Weight                    1303 non-null   object
 12  Price (Euros)             1303 non-null   object
dtypes: object(13)
memory usage: 132.5+ KB


We can access the column axis of a dataframe using the DataFrame.columns attribute. This returns an index object — a special type of NumPy ndarray — with the labels of each column:

In [2]:
print(laptops.columns)

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', ' Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')


Remove any whitespace from the start and end of each column name.
Create an empty list named new_columns.
Use a for loop to iterate through each column name using the DataFmrame.columns attribute. Inside the body of the for loop:
Use the str.strip() method to remove whitespace from the start and end of the string.
Append the updated column name to the new_columns list.
Assign the updated column names to the DataFrame.columns attribute.

In [3]:
# new_columns = []

# column_names = laptops.columns

# for c in column_names:
    # c_ok = c.strip()
    # new_columns.append(c_ok)
    
# laptops.columns = new_columns

new_columns = []
for c in laptops.columns:
    clean_c = c.strip()
    new_columns.append(clean_c)
    
laptops.columns = new_columns

We can create a function that uses Python string methods to clean our column labels, and then again use a loop to apply that function to each label. Let's look at an example:

In [4]:
def clean_col(col):
    col = col.replace(" ","_")
    col = col.replace("(","")
    col = col.replace(")","")
    col = col.lower()
    return col

new_columns = []
for c in laptops.columns:
    clean_c = clean_col(c)
    new_columns.append(clean_c)

laptops.columns = new_columns
print(laptops.columns)

Index(['manufacturer', 'model_name', 'category', 'screen_size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'operating_system',
       'operating_system_version', 'weight', 'price_euros'],
      dtype='object')


Define a function, which:
Used the str.strip() method to remove whitespace from the start and end of the string.
Used the str.replace() method to remove parentheses from the string.
Used the str.lower() method to make the string lowercase.
Returns the modified string.
Used a loop to apply the function to each item in the index object and assign it back to the DataFrame.columns attribute.
Printed the new values for the DataFrame.columns attribute.

In [5]:
import pandas as pd
laptops = pd.read_csv('laptops.csv', encoding='Latin-1')
def clean_col(col):
    col = col.strip()
    col = col.replace("Operating System", "os")
    col = col.replace(" ","_")
    col = col.replace("(","")
    col = col.replace(")","")
    col = col.lower()
    return col

new_columns = []
for c in laptops.columns:
    clean_c = clean_col(c)
    new_columns.append(clean_c)
    
laptops.columns = new_columns

laptops.columns

#R replace SINGLE column
# colnames(laptops)[10] <- 'os_system'
# colnames(laptops)[11] <- 'os_system_version'

#laptops.columns[0]

Index(['manufacturer', 'model_name', 'category', 'screen_size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'os', 'os_version', 'weight',
       'price_euros'],
      dtype='object')

Use the Series.unique() method to identify the unique values in the ram column of the laptops dataframe. Assign the result to unique_ram.
After running your code, use the variable inspector to view the unique values in the ram column and identify any patterns.


In [6]:
unique_ram = laptops['ram'].unique()

#R unique(laptops$ram)

The pandas library contains dozens of vectorized string methods we can use to manipulate text data, many of which perform the same operations as Python string methods. Most vectorized string methods are available using the Series.str accessor, which means we can access them by adding str between the series name and the method name:
    
In this case, we can use the Series.str.replace() method, which is a vectorized version of the Python str.replace() method we used in the previous screen, to remove all the quote characters from every string in the screen_size column:
    
    

In [7]:
laptops['screen_size'] = laptops['screen_size'].str.replace('"','')
print(laptops['screen_size'].unique())

['13.3' '15.6' '15.4' '14.0' '12.0' '11.6' '17.3' '10.1' '13.5' '12.5'
 '13.0' '18.4' '13.9' '12.3' '17.0' '15.0' '14.1' '11.3']


Use the Series.str.replace() method to remove the substring GB from the ram column.
Use the Series.unique() method to assign the unique values in the ram column to unique_ram.
After running your code, use the variable inspector to verify your changes.

In [8]:
laptops['ram'] = laptops['ram'].str.replace('GB','')
unique_ram = laptops['ram'].unique()

print(unique_ram)


#R laptops <- laptops %>% 
#R  mutate(ram = str_replace(ram,'GB',''))

['8' '16' '4' '2' '12' '6' '32' '24' '64']


To do this, we use the Series.astype() method. To convert the column to a numeric dtype, we can use either int or float as the parameter for the method. Since the int dtype can't store decimal values, we'll convert the screen_size column to the float dtype:

In [9]:
laptops["screen_size"] = laptops["screen_size"].astype(float)
print(laptops["screen_size"].dtype)
print(laptops["screen_size"].unique())

float64
[13.3 15.6 15.4 14.  12.  11.6 17.3 10.1 13.5 12.5 13.  18.4 13.9 12.3
 17.  15.  14.1 11.3]


In [10]:
laptops['ram'] = laptops['ram'].astype(int)

dtypes = laptops.dtypes

print(dtypes)

manufacturer     object
model_name       object
category         object
screen_size     float64
screen           object
cpu              object
ram               int32
storage          object
gpu              object
os               object
os_version       object
weight           object
price_euros      object
dtype: object


To stop us from losing information the helps us understand the data, we can use the DataFrame.rename() method to rename the column from screen_size to screen_size_inches.

Below, we specify the axis=1 parameter so pandas knows that we want to rename labels in the column axis:



In [11]:
laptops.rename({"screen_size": "screen_size_inches"}, axis=1, inplace=True)
print(laptops.dtypes)

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram                     int32
storage                object
gpu                    object
os                     object
os_version             object
weight                 object
price_euros            object
dtype: object


Because the GB characters contained useful information about the units (gigabytes) of the laptop's ram, use the DataFrame.rename() method to rename the column from ram to ram_gb.
Use the Series.describe() method to return a series of descriptive statistics for the ram_gb column. Assign the result to ram_gb_desc.
After you have run your code, use the variable inspector to see the results of your code.

In [12]:
laptops.rename({'ram':'ram_gb'}, axis = 1, inplace = True)

ram_gb_desc = laptops['ram_gb'].describe()

print(ram_gb_desc)

count    1303.000000
mean        8.382195
std         5.084665
min         2.000000
25%         4.000000
50%         8.000000
75%         8.000000
max        64.000000
Name: ram_gb, dtype: float64


Just like with lists and ndarrays, we can use bracket notation to access the elements in each list in the series. **With series, however, we use the str accessor followed by [] (brackets):**

In [13]:
print(laptops["gpu"].head().str.split().str[0])

#R str_split by default is a list -> simplify = TRUE to convert to matrix

0    Intel
1    Intel
2    Intel
3      AMD
4    Intel
Name: gpu, dtype: object


Above, we used 0 to select the first element in each list. Above is the result:

In the example code, we have extracted the manufacturer name from the gpu column, and assigned it to a new column gpu_manufacturer.

Extract the manufacturer name from the cpu column. Assign it to a new column cpu_manufacturer.
Use the Series.value_counts() method to find the counts of each manufacturer in cpu_manufacturer. Assign the result to cpu_manufacturer_counts.

In [14]:
laptops["gpu_manufacturer"] = (laptops["gpu"]
                                       .str.split()
                                       .str[0]
                              )


laptops['cpu_manufacturer'] = laptops['cpu'].str.split().str[0]

cpu_manufacturer_counts = laptops['cpu_manufacturer'].value_counts()

print(cpu_manufacturer_counts)

Intel      1240
AMD          62
Samsung       1
Name: cpu_manufacturer, dtype: int64


In [15]:
laptops['os'].value_counts()

Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          13
Mac OS          8
Android         2
Name: os, dtype: int64

We can see that each of our corrections were made across our series. One important thing to remember with Series.map() is that if a value from your series doesn't exist as a key in your dictionary, it will convert that value to NaN. Let's see what happens when we run map one more time:

Because none of the corrected values in our series existed as keys in our dictionary, all values became NaN! It's a very common occurence, especially when working in Jupyter notebook, where you can easily re-run cells.
    
We have created a dictionary for you to use with mapping. Note that we have included both the correct and incorrect spelling of macOS as keys, otherwise we'll end up with null values.

Use the Series.map() method with the mapping_dict dictionary to correct the values in the os column.

In [16]:
#Be sure to store the column back to the os column.


mapping_dict = {
    'Android': 'Android',
    'Chrome OS': 'Chrome OS',
    'Linux': 'Linux',
    'Mac OS': 'macOS',
    'No OS': 'No OS',
    'Windows': 'Windows',
    'macOS': 'macOS'
}

laptops['os'] = laptops['os'].map(mapping_dict)

laptops['os'].value_counts()

Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          21
Android         2
Name: os, dtype: int64

Use DataFrame.dropna() to remove any rows from the laptops dataframe that have null values. Assign the result to laptops_no_null_rows.
Use DataFrame.dropna() to remove any columns from the laptops dataframe that have null values. Assign the result to laptops_no_null_cols.

In [17]:
laptops_no_null_rows = laptops.dropna(axis = 0)
laptops_no_null_cols = laptops.dropna(axis = 1)

#R laptops_1 <- laptops[complete.cases(laptops),]



Because of this, it's a good idea to explore the missing values in the os_version column before making a decision. We can use Series.value_counts() to explore all of the values in the column, but we'll use a parameter we haven't seen before:

In [18]:
print(laptops['os_version'].value_counts(dropna = False))

10      1072
NaN      170
7         45
X          8
10 S       8
Name: os_version, dtype: int64


Because we set the dropna parameter to False, the result includes null values. We can see that the majority of values in the column are 10 and missing values are the next most common.

Let's also explore the os column, since it's is closely related to the os_version column. We'll only look at rows in which the os_version is missing:

os_with_n

In [19]:
os_with_null_v = laptops.loc[laptops["os_version"].isnull(),"os"]
print(os_with_null_v.value_counts())

No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: os, dtype: int64


We can use assignment with a boolean comparison to perform this replacement, like below:

The most frequent value is "No OS". This is important to note because if there is no os, there shouldn't be a version defined in the os_version column.
Thirteen of the laptops that come with macOS do not specify the version. We can use our knowledge of MacOS to confirm that os_version should be equal to X.


In [20]:
value_counts_before = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()

laptops.loc[laptops['os'] == 'macOS', 'os_version']= 'X' #dont forget the loc

Use a boolean array to identify rows that have the value No OS for the os column. Then, use assignment to assign the value Version Unknown to the os_version column for those rows.
Use the syntax below to create value_counts_after variable:

value_counts_after = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()

After running your code, use the variable inspector to look at the difference between value_counts_before and value_counts_after.

In [21]:
laptops.loc[laptops['os'] == 'No OS', 'os_version'] = 'Version Unknown' #dont forget the loc

value_counts_after = laptops.loc[laptops['os_version'].isnull(),'os'].value_counts()

In [22]:

laptops['weight'] = laptops["weight"].str.replace("kgs","").str.replace("kg","").astype(float) #like a pipe operator, continuous

laptops.rename({"weight": "weight_kg"}, axis=1, inplace=True) #for renaming of cols, no need to specify column name, already specified in {}

laptops.to_csv('laptops_cleaned.csv',index=False)