<a href="https://colab.research.google.com/github/chrismarkella/Kaggle-access-from-Google-Colab/blob/master/read_selected_columns_from_CSV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Read only selected columns from a CSV file.
Let's say we have a CSV file with hundreds of columns.

We need only a few dosens of these columns to work with.

We will see how to select and read only the needed columns out of all the columns in the CSV file.

We will use the `mercedesbenz.csv` file from `Kaggle` to demonstrate this.

###Setup

In [0]:
import os

import numpy as np
import pandas as pd

from getpass import getpass 

In [2]:
def access_kaggle():
    """
    Access Kaggle from Google Colab.
    If the /root/.kaggle does not exist then prompt for
    the username and for the Kaggle API key.
    Creates the kaggle.json access file in the /root/.kaggle/ folder. 
    """
    KAGGLE_ROOT = os.path.join('/root', '.kaggle')
    KAGGLE_PATH = os.path.join(KAGGLE_ROOT, 'kaggle.json')

    if '.kaggle' not in os.listdir(path='/root'):
        user = getpass(prompt='Kaggle username: ')
        key  = getpass(prompt='Kaggle API key: ')
        
        !mkdir $KAGGLE_ROOT
        !touch $KAGGLE_PATH
        !chmod 666 $KAGGLE_PATH
        with open(KAGGLE_PATH, mode='w') as f:
            f.write('{"username":"%s", "key":"%s"}' %(user, key))
            f.close()
        !chmod 600 $KAGGLE_PATH
        del user
        del key
        success_msg = "Kaggle is successfully set up. Good to go."
        print(f'{success_msg}')

access_kaggle()


Kaggle username: ··········
Kaggle API key: ··········
Kaggle is successfully set up. Good to go.


In [3]:
!kaggle datasets download yogeerp/mercedes --unzip

Downloading mercedes.zip to /content
  0% 0.00/293k [00:00<?, ?B/s]
100% 293k/293k [00:00<00:00, 43.0MB/s]


In [4]:
ls -lh

total 3.1M
-rw-r--r-- 1 root root 3.1M Jan 15 18:51 mercedesbenz.csv
drwxr-xr-x 1 root root 4.0K Jan 13 16:38 [0m[01;34msample_data[0m/


In [5]:
!cat mercedesbenz.csv|head -3

ID,y,X0,X1,X2,X3,X4,X5,X6,X8,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,X24,X26,X27,X28,X29,X30,X31,X32,X33,X34,X35,X36,X37,X38,X39,X40,X41,X42,X43,X44,X45,X46,X47,X48,X49,X50,X51,X52,X53,X54,X55,X56,X57,X58,X59,X60,X61,X62,X63,X64,X65,X66,X67,X68,X69,X70,X71,X73,X74,X75,X76,X77,X78,X79,X80,X81,X82,X83,X84,X85,X86,X87,X88,X89,X90,X91,X92,X93,X94,X95,X96,X97,X98,X99,X100,X101,X102,X103,X104,X105,X106,X107,X108,X109,X110,X111,X112,X113,X114,X115,X116,X117,X118,X119,X120,X122,X123,X124,X125,X126,X127,X128,X129,X130,X131,X132,X133,X134,X135,X136,X137,X138,X139,X140,X141,X142,X143,X144,X145,X146,X147,X148,X150,X151,X152,X153,X154,X155,X156,X157,X158,X159,X160,X161,X162,X163,X164,X165,X166,X167,X168,X169,X170,X171,X172,X173,X174,X175,X176,X177,X178,X179,X180,X181,X182,X183,X184,X185,X186,X187,X189,X190,X191,X192,X194,X195,X196,X197,X198,X199,X200,X201,X202,X203,X204,X205,X206,X207,X208,X209,X210,X211,X212,X213,X214,X215,X216,X217,X218,X219,X220,X221,X222,X223,X224,X225,X226,X227

###We can see that there are `378 columns` in this CSV file.

In [6]:
csv_file_name = 'mercedesbenz.csv'
df = pd.read_csv(csv_file_name, sep=',')
df.columns

Index(['ID', 'y', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8',
       ...
       'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384',
       'X385'],
      dtype='object', length=378)

In [7]:
df.columns[:50]

Index(['ID', 'y', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8', 'X10', 'X11',
       'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21',
       'X22', 'X23', 'X24', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32',
       'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42',
       'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50'],
      dtype='object')

###Task
- Read only column `y` and columns `X0` through `X30` from the CSV file.
- These numbers are `1-30`, but not `7`, `9` and `25`.
- Using the `usecols` parameter in the `read_csv` function to read only selected columns.

####Approach 1:
- using `range` and skip `7`, `9` and `25`.
- using `f-string` to generate the column names.

In [8]:
column_numbers = range(1,30+1)
numbers_to_skip = [7,9,25]
columns_to_read = [f'X{i}' for i in column_numbers if i not in numbers_to_skip]
columns_to_read.insert(0, 'y')
columns_to_read

['y',
 'X1',
 'X2',
 'X3',
 'X4',
 'X5',
 'X6',
 'X8',
 'X10',
 'X11',
 'X12',
 'X13',
 'X14',
 'X15',
 'X16',
 'X17',
 'X18',
 'X19',
 'X20',
 'X21',
 'X22',
 'X23',
 'X24',
 'X26',
 'X27',
 'X28',
 'X29',
 'X30']

In [9]:
df = pd.read_csv(csv_file_name, sep=',',
            usecols=columns_to_read)
df.head(3)

Unnamed: 0,y,X1,X2,X3,X4,X5,X6,X8,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,X24,X26,X27,X28,X29,X30
0,130.81,v,at,a,d,u,j,o,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0
1,88.53,t,av,e,d,y,l,o,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0
2,76.26,w,n,c,d,x,j,x,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,1,0


In [10]:
df.columns

Index(['y', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8', 'X10', 'X11', 'X12',
       'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22',
       'X23', 'X24', 'X26', 'X27', 'X28', 'X29', 'X30'],
      dtype='object')

####Approach 2:
- concatinate ranges using `chain` from `itertools`.
- using `f-string` too to generate the column names.

In [11]:
from itertools import chain

column_numbers_ranges = chain(
    range(1,6+1),
    range(8,8+1),
    range(10,24+1),
    range(26,30+1)
)
columns_to_read_ranges = [f'X{i}' for i in column_numbers_ranges]
columns_to_read_ranges.insert(0,'y')
columns_to_read_ranges

['y',
 'X1',
 'X2',
 'X3',
 'X4',
 'X5',
 'X6',
 'X8',
 'X10',
 'X11',
 'X12',
 'X13',
 'X14',
 'X15',
 'X16',
 'X17',
 'X18',
 'X19',
 'X20',
 'X21',
 'X22',
 'X23',
 'X24',
 'X26',
 'X27',
 'X28',
 'X29',
 'X30']

In [12]:
df2 = pd.read_csv(csv_file_name, sep=',',
                  usecols=columns_to_read_ranges)
df2.head(2)

Unnamed: 0,y,X1,X2,X3,X4,X5,X6,X8,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,X24,X26,X27,X28,X29,X30
0,130.81,v,at,a,d,u,j,o,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0
1,88.53,t,av,e,d,y,l,o,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0


In [13]:
df2.columns

Index(['y', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8', 'X10', 'X11', 'X12',
       'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22',
       'X23', 'X24', 'X26', 'X27', 'X28', 'X29', 'X30'],
      dtype='object')