### 1. **Introduction to Sqoop**

- **Purpose**: Sqoop is primarily used to import data from relational databases like MySQL, Oracle, PostgreSQL, and SQL Server into Hadoop's HDFS (Hadoop Distributed File System), Hive, or HBase. It can also export data from Hadoop back to these databases.
- **Name Origin**: Sqoop stands for "SQL-to-Hadoop."

### 2. **Features of Sqoop**

- **Full Load and Incremental Load**: Supports both full data loading and incremental loading (importing only new data).
- **Parallel Import/Export**: Uses MapReduce to import/export data, allowing parallel operations for high performance.
- **Data Compression**: Supports various compression techniques (e.g., gzip, bzip2) to reduce storage space and increase transfer speed.
- **Data Format Support**: Supports various data formats, including text files, Avro, SequenceFiles, and Parquet.
- **Integration with Hive/HBase**: Directly imports data into Hive tables or HBase.
- **Security**: Supports Kerberos authentication and secure password storage.

### 3. **Sqoop Architecture**

Sqoop operates in the following steps:

1. **Client-Side Commands**: Users issue Sqoop commands from the client-side interface.
2. **Sqoop Driver**: The driver interprets these commands and manages the import/export jobs.
3. **MapReduce Job**: Sqoop generates MapReduce jobs for importing/exporting data. 
4. **Data Source Connector**: Sqoop uses a connector (JDBC driver) specific to the database to interact with the source or target database.
5. **HDFS/Hive/HBase**: Data is transferred between the source/target database and Hadoop storage components like HDFS, Hive, or HBase.

### 4. **Common Sqoop Commands**

- **Import Data from RDBMS to HDFS**:
  ```bash
  sqoop import --connect jdbc:mysql://localhost:3306/database_name --username root --password password --table table_name --target-dir /user/hdfs/target_dir
  ```
  - `--connect`: JDBC connection string for the database.
  - `--username` and `--password`: Database credentials.
  - `--table`: Name of the table to import.
  - `--target-dir`: HDFS directory where data will be stored.

- **Import Data from RDBMS to Hive**:
  ```bash
  sqoop import --connect jdbc:mysql://localhost:3306/database_name --username root --password password --table table_name --hive-import --create-hive-table --hive-table hive_db.hive_table
  ```
  - `--hive-import`: Directly imports data into a Hive table.
  - `--create-hive-table`: Creates the Hive table if it doesn't exist.
  - `--hive-table`: Specifies the target Hive table.

- **Import Data from RDBMS to HBase**:
  ```bash
  sqoop import --connect jdbc:mysql://localhost:3306/database_name --username root --password password --table table_name --hbase-table hbase_table --column-family column_family
  ```
  - `--hbase-table`: Name of the HBase table where data will be imported.
  - `--column-family`: Column family in HBase.

- **Incremental Import**:
  ```bash
  sqoop import --connect jdbc:mysql://localhost:3306/database_name --username root --password password --table table_name --incremental append --check-column id --last-value 100
  ```
  - `--incremental append`: Performs incremental import by appending new data.
  - `--check-column`: Column used to check for new data.
  - `--last-value`: Last imported value for the `--check-column`.

- **Export Data from HDFS to RDBMS**:
  ```bash
  sqoop export --connect jdbc:mysql://localhost:3306/database_name --username root --password password --table table_name --export-dir /user/hdfs/target_dir
  ```
  - `--export-dir`: HDFS directory containing data to export.

### 5. **Sqoop Connectors**

- **JDBC Connector**: Most databases support JDBC (Java Database Connectivity), which Sqoop uses to connect to various databases.
- **Specific Connectors**: Sqoop has specific connectors optimized for certain databases like MySQL, PostgreSQL, Oracle

In [9]:
import pandas as pd
from sqlalchemy import create_engine

# Create a sample employee DataFrame
data = {
    'employee_id': [101, 102, 103, 104],
    'first_name': ['John', 'Jane', 'Mike', 'Sara'],
    'last_name': ['Doe', 'Smith', 'Johnson', 'Brown'],
    'age': [28, 34, 29, 42],
    'department': ['Sales', 'HR', 'Engineering', 'Marketing'],
    'salary': [50000, 60000, 75000, 85000]
}

employee_df = pd.DataFrame(data)
print("Sample Employee Data:")
print(employee_df)

# Define the database connection parameters
db_user = 'maneelcha49dgre'
db_password = 'BrownGorilla19$'
db_host = 'master'  # or your MySQL server address
db_port = '3306'       # default MySQL port
db_name = 'maneelcha49dgre'

# Create the SQLAlchemy engine
engine = create_engine(f'mysql+mysqlconnector://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}')

# Send the DataFrame to the MySQL database
table_name = 'empTable'

try:
    employee_df.to_sql(name=table_name, con=engine, if_exists='replace', index=False)#send table to hive
    print(f"Data successfully inserted into the '{table_name}' table.")
except Exception as e:
    print(f"An error occurred: {e}")


Sample Employee Data:
   employee_id first_name last_name  age   department  salary
0          101       John       Doe   28        Sales   50000
1          102       Jane     Smith   34           HR   60000
2          103       Mike   Johnson   29  Engineering   75000
3          104       Sara     Brown   42    Marketing   85000
Data successfully inserted into the 'empTable' table.


In [11]:
pd.read_sql('select * from empTable ' , engine)

Unnamed: 0,employee_id,first_name,last_name,age,department,salary
0,101,John,Doe,28,Sales,50000
1,102,Jane,Smith,34,HR,60000
2,103,Mike,Johnson,29,Engineering,75000
3,104,Sara,Brown,42,Marketing,85000


# Execute this command in web shell

### change username , password and table name

### sqoop import --connect jdbc:mysql://master:3306/ankit810248bgre --username ankit810248bgre --password SilverOwl38$ --table employees_1 --hive
### -import --create-hive-table --hive-table ankit.employees_1 -m 1