# Gathering data using csv

In [1]:
# first we need to import pandas 
import pandas as pd

### `opening a csv file using read_csv`

In [5]:
data = pd.read_csv("tmdb_5000_credits.csv") # it datafile name.csv only works when the this file(ipynb) and data file should be in same folder

### `opening a csv file from an url`

In [18]:
import requests
from io import StringIO
import pandas as pd

url = "https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/115.0.0.0 Safari/537.36"
}

req = requests.get(url=url, headers=headers)

# Convert CSV text into a file-like object
random = StringIO(req.text)

# Read into DataFrame
df = pd.read_csv(random)


### `sep parameter` :
```bash 
it will be useful when we opening files like tab(\t) seperator .tsv files.
this you have to specify the `sep` parameter in the `read_csv()` function
```

In [10]:
data = pd.read_csv("tmdb_5000_credits.csv",sep=",")
data.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


### `names parameter` :
```bash
used to specifies your own column names when the columns names are not existed in dataset
```

In [19]:
df = pd.read_csv(random,names=['Month',"Year 1958","Year 1959","Year 1960"])
df.head()

Unnamed: 0,Month,Year 1958,Year 1959,Year 1960


### `index_col parameter` :
```bash
used when you want to make the any column as index column
```

In [21]:
data = pd.read_csv("tmdb_5000_credits.csv",sep=",",index_col=0)
data.head() # in this case movie_id is a index column

Unnamed: 0_level_0,title,cast,crew
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


### `header parameter` :
```bash
 when you want to use any row  as  column names
```

In [26]:
data = pd.read_csv("tmdb_5000_credits.csv",sep=",",header=0)
data.head() # in this case movie_id is a index column

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


### `use_cols parameter` :
```bash
 when you want to load only few columns 
```

In [27]:
data = pd.read_csv("tmdb_5000_credits.csv",sep=",",usecols=["movie_id","title"])
data.head()

Unnamed: 0,movie_id,title
0,19995,Avatar
1,285,Pirates of the Caribbean: At World's End
2,206647,Spectre
3,49026,The Dark Knight Rises
4,49529,John Carter


### `Skiprows/nrows parameter` :
```bash
 when you want to skip some rows 
```

In [33]:
data = pd.read_csv("tmdb_5000_credits.csv",skiprows=[1,3,5])
data.head()

Unnamed: 0,movie_id,title,cast,crew
0,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
1,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
2,559,Spider-Man 3,"[{""cast_id"": 30, ""character"": ""Peter Parker / ...","[{""credit_id"": ""52fe4252c3a36847f80151a5"", ""de..."
3,38757,Tangled,"[{""cast_id"": 34, ""character"": ""Flynn Rider (vo...","[{""credit_id"": ""52fe46db9251416c91062101"", ""de..."
4,99861,Avengers: Age of Ultron,"[{""cast_id"": 76, ""character"": ""Tony Stark / Ir...","[{""credit_id"": ""55d5f7d4c3a3683e7e0016eb"", ""de..."


### `nrows parameter` :
```bash
 when you want to limit the number of rows 
```

In [3]:
data = pd.read_csv("tmdb_5000_credits.csv")
data

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
...,...,...,...,...
4798,9367,El Mariachi,"[{""cast_id"": 1, ""character"": ""El Mariachi"", ""c...","[{""credit_id"": ""52fe44eec3a36847f80b280b"", ""de..."
4799,72766,Newlyweds,"[{""cast_id"": 1, ""character"": ""Buzzy"", ""credit_...","[{""credit_id"": ""52fe487dc3a368484e0fb013"", ""de..."
4800,231617,"Signed, Sealed, Delivered","[{""cast_id"": 8, ""character"": ""Oliver O\u2019To...","[{""credit_id"": ""52fe4df3c3a36847f8275ecf"", ""de..."
4801,126186,Shanghai Calling,"[{""cast_id"": 3, ""character"": ""Sam"", ""credit_id...","[{""credit_id"": ""52fe4ad9c3a368484e16a36b"", ""de..."


### `encoding parameter` :
```bash
 some times we get the unicode error while loading the file then it will help more
 example : 
    data = pd.read_csv("tmdb_5000_credits.csv",encoding='latin-1')
```

### `na_values parameter` :
```bash
 some times in our data contains random symbols (like - , etc) as values so used to convert those to NAN values
 example : 
    data = pd.read_csv("tmdb_5000_credits.csv",na_values=['-',"@"])
```

### `parse_dates parameter` :
```bash
 is used to convert the string dates to actual date format
 example : 
    data = pd.read_csv("tmdb_5000_credits.csv",parse_dates=[date_column_name]})
```

### `convertor parameter` :
```bash
 is used to convert the any string into another like "Sunrisers Hyderabad" to "SRH"
example : 
    def custom_function(value) : 
        if value == "Sunrisers Hyderabad" : 
            return "SRH"
        else : 
            return value
    data = pd.read_csv("tmdb_5000_credits.csv",converters={column_name : custom function})
```

### `chunksize parameter` :
```bash
it is helpful when your dataset contains large number of rows then you can divide the dataset into chunks with minimum number of rows
example :  
     data = pd.read_csv("tmdb_5000_credits.csv",chunksize=5000) # 5000 rows

     for chunk in data : 
         print(chunk.shape)
```