Skip to content
Permalink
Browse files

Updated to version 2.1.1

  • Loading branch information...
rcalandrelli committed Jul 25, 2019
1 parent c5cf649 commit 6ab9870b3216e3a11c7dde3c64bd2792d79d74a5
BIN +0 Bytes (100%) .DS_Store
Binary file not shown.
@@ -1,4 +1,4 @@
# HiCtool: a standardized pipeline to process and visualize Hi-C data (v2.1)
# HiCtool: a standardized pipeline to process and visualize Hi-C data (v2.1.1)

HiCtool is an open-source bioinformatic tool based on Python, which integrates several software to perform a standardized Hi-C data analysis, from the processing of raw data, to the visualization of heatmaps and the identification of topologically associated domains (TADs).

@@ -7,8 +7,8 @@ HiCtool is an open-source bioinformatic tool based on Python, which integrates s
- [Overview](#overview)
- [Installation](#installation)
- [Tutorial](#tutorial)
- [Version history](#version-history)
- [Reference](#reference)
- [Version history](#version-history)
- [Support](#support)

## Overview
@@ -59,8 +59,27 @@ HiCtool is in a pipeline format based on single unix commands to allow easy usag

We have compiled a full tutorial to show the usage of the pipeline. Please check the [Tutorial Homepage](./tutorial/ReadMe.md).

## Reference

HiCtool was developed by Riccardo Calandrelli and Qiuyang Wu from Dr. Sheng Zhong's Lab at University of California, San Diego.

If you use HiCtool, please cite the paper:

Calandrelli, R., Wu, Q., Guan, J., & Zhong, S. (2018). GITAR: An open source tool for analysis and visualization of Hi-C data. *Genomics, proteomics & bioinformatics.*

## Version history

### July 25, 2019

- Version 2.1.1 released:

- Optimized pre-processing and normalization to allow the analysis of very high depth sequenced files.
- Added parameter to allow custom mapped reads filtering at the pre-processing step.
- Bedpe file generation step added for pre-processed read pairs.
- Deduplication step added in the pre-processing pipeline.
- Optimized and faster FEND file generation for adding GC content and mappability score.
- Small bug fixes.

### May 20, 2019

- Version 2.1 released:
@@ -88,14 +107,6 @@ We have compiled a full tutorial to show the usage of the pipeline. Please check

- The initial release of HiCtool v1.0 came out in late 2015. GITAR manuscript (including HiCtool) published in October 2018.

## Reference

HiCtool was developed by Riccardo Calandrelli and Qiuyang Wu from Dr. Sheng Zhong's Lab at University of California, San Diego.

If you use HiCtool, please cite the paper:

Calandrelli, R., Wu, Q., Guan, J., & Zhong, S. (2018). GITAR: An open source tool for analysis and visualization of Hi-C data. *Genomics, proteomics & bioinformatics.*

## Support

For issues related to the use of HiCtool or if you want to report a bug, please contact Riccardo Calandrelli at <rcalandrelli@eng.ucsd.edu>.
BIN +0 Bytes (100%) scripts/.DS_Store
Binary file not shown.
@@ -804,7 +804,7 @@ def full_tad_analysis(input_contact_matrix,
species=species,
bin_size=bin_size,
data_type=data_type,
save_output=True,
save_output=False,
save_tab=False)

# DI VALUES
@@ -26,55 +26,34 @@
'gc_files_path': None,
'threads': None}


def add_gc_content(chromosome):
output_path = parameters['restrictionsites_path']
gc_files_path = parameters['gc_files_path']

content = []
with open(output_path + 'chr' + chromosome + '.bed') as f:
for line in f:
content.append(line.strip().split())
content = np.array(content)

content[:,4] = np.arange(content.shape[0])
pos = np.int_(((np.int_(content[:,1]))+3)/5)*5

# Building a bed file from restrictionsites with start coordinate as 200 bp upstream of each position and end coordinate as position.
content[:,1] = pos-200
content[:,2] = pos
content[np.int_(content[:,1]) <= 0,1] = 0

with open(output_path + 'chr' + chromosome + '_upstream.bed', 'w') as f:
for x in content.tolist():
f.write("%s\n" % '\t'.join(x))

# Building a bed file from restrictionsites with start coordinate as position and end coordinate as 200 bp downstream of each position.
content[:,1] = pos
content[:,2] = pos+200

with open(output_path + 'chr' + chromosome + '_downstream.bed', 'w') as f:
for x in content.tolist():
f.write("%s\n" % '\t'.join(x))
upstream = pd.read_csv(output_path + 'chr' + chromosome + '.bed', sep = '\t', header = None)
upstream.iloc[:,4] = range(upstream.shape[0])
upstream.iloc[:,2] = upstream.iloc[:,1]
upstream.iloc[:,1] = upstream.iloc[:,1] - 200
upstream.loc[upstream.loc[:,1] < 0, 1] = 0
a_up = pybedtools.BedTool.from_dataframe(upstream)

a_up = pybedtools.example_bedtool(output_path + 'chr' + chromosome + '_upstream.bed')
a_down = pybedtools.example_bedtool(output_path + 'chr' + chromosome + '_downstream.bed')
# Building a bed file from restrictionsites with start coordinate as position and end coordinate as 200 bp downstream of each position.
downstream = pd.read_csv(output_path + 'chr' + chromosome + '.bed', sep = '\t', header = None)
downstream.iloc[:,4] = range(downstream.shape[0])
downstream.iloc[:,1] = downstream.iloc[:,2]
downstream.iloc[:,2] = downstream.iloc[:,2] + 200
a_down = pybedtools.BedTool.from_dataframe(downstream)

result = pd.read_table(output_path + 'chr' + chromosome + '.bed', header=None)
result = pd.read_csv(output_path + 'chr' + chromosome + '.bed', header=None, sep='\t')
result.columns = ['chr', 'start','stop','name','score','strand']
result['gc_upstream'] = '' # adding a field to store the upstream GC content
result['gc_downstream'] = '' # adding a field to store the upstream GC content

b = pybedtools.example_bedtool(gc_files_path + 'chr' + chromosome + '.txt')

aup_and_b = a_up.intersect(b,wa=True,wb=True)
adown_and_b = a_down.intersect(b,wa=True,wb=True)

aup_and_b.saveas(output_path + 'upstream_intersect_chr' + chromosome + '.bed')
adown_and_b.saveas(output_path + 'downstream_intersect_chr' + chromosome + '.bed')

aup_and_b = pd.read_table(output_path + 'upstream_intersect_chr' + chromosome + '.bed', header=None)
adown_and_b = pd.read_table(output_path + 'downstream_intersect_chr' + chromosome + '.bed', header=None)
aup_and_b = a_up.intersect(b,wa=True,wb=True).to_dataframe()
adown_and_b = a_down.intersect(b,wa=True,wb=True).to_dataframe()

aup_and_b.columns = ['chr','start','stop','name','index','strand','chr_gc','start_gc','stop_gc','gc_up']
adown_and_b.columns = ['chr','start','stop','name','index','strand','chr_gc','start_gc','stop_gc','gc_down']
@@ -89,21 +68,17 @@ def add_gc_content(chromosome):

r1 = 0
for r in np.array(df.ind):
if result.ix[r,'strand'] == '+':
result.ix[r,'gc_upstream'] = np.array(df.ix[r1,'gc_up']).astype(np.str)
result.ix[r,'gc_downstream'] = np.array(df.ix[r1,'gc_down']).astype(np.str)
if result.ix[r,'strand'] == '-':
result.ix[r,'gc_upstream'] = np.array(df.ix[r1,'gc_down']).astype(np.str)
result.ix[r,'gc_downstream'] = np.array(df.ix[r1,'gc_up']).astype(np.str)
if result.loc[r,'strand'] == '+':
result.loc[r,'gc_upstream'] = np.array(df.loc[r1,'gc_up']).astype(np.str)
result.loc[r,'gc_downstream'] = np.array(df.loc[r1,'gc_down']).astype(np.str)
if result.loc[r,'strand'] == '-':
result.loc[r,'gc_upstream'] = np.array(df.loc[r1,'gc_down']).astype(np.str)
result.loc[r,'gc_downstream'] = np.array(df.loc[r1,'gc_up']).astype(np.str)
r1 += 1

result.score = 1
result.to_csv(path_or_buf=output_path + 'chr' + chromosome + '_gc.bed',sep='\t',header=False,index=False)

os.remove(output_path + 'chr' + chromosome + '_upstream.bed')
os.remove(output_path + 'chr' + chromosome + '_downstream.bed')
os.remove(output_path + 'upstream_intersect_chr' + chromosome + '.bed')
os.remove(output_path + 'downstream_intersect_chr' + chromosome + '.bed')
print "GC content information for chr" + chromosome + " complete."


@@ -24,8 +24,7 @@
'species': None,
'restrictionsites_path': None,
'artificial_reads_path': None,
'threads': None}

'threads': None}

def add_mappability_score(chromosome):
"""
@@ -42,34 +41,20 @@ def add_mappability_score(chromosome):
output_path = parameters['restrictionsites_path']
artificial_reads_path = parameters['artificial_reads_path']

content = []
with open(output_path + 'chr' + chromosome + '.bed') as f:
for line in f:
content.append(line.strip().split())
content = np.array(content)

content[:,4] = np.arange(content.shape[0])
pos = np.int_(((np.int_(content[:,1]))+3)/5)*5

# Building a bed file from restrictionsites with start coordinate as 500 bp upstream of each position and end coordinate as position.
content[:,1] = pos-500
content[:,2] = pos
content[np.int_(content[:,1]) <= 0,1] = 0

with open(output_path + 'chr' + chromosome + '_upstream.bed', 'w') as f:
for x in content.tolist():
f.write("%s\n" % '\t'.join(x))
upstream = pd.read_csv(output_path + 'chr' + chromosome + '.bed', sep = '\t', header = None)
upstream.iloc[:,4] = range(upstream.shape[0])
upstream.iloc[:,2] = upstream.iloc[:,1]
upstream.iloc[:,1] = upstream.iloc[:,1] - 500
upstream.loc[upstream.loc[:,1] < 0, 1] = 0
a_up = pybedtools.BedTool.from_dataframe(upstream)

# Building a bed file from restrictionsites with start coordinate as position and end coordinate as 500 bp downstream of each position.
content[:,1] = pos
content[:,2] = pos+500

with open(output_path + 'chr' + chromosome + '_downstream.bed', 'w') as f:
for x in content.tolist():
f.write("%s\n" % '\t'.join(x))

a_up = pybedtools.example_bedtool(output_path + 'chr' + chromosome + '_upstream.bed')
a_down = pybedtools.example_bedtool(output_path + 'chr' + chromosome + '_downstream.bed')
downstream = pd.read_csv(output_path + 'chr' + chromosome + '.bed', sep = '\t', header = None)
downstream.iloc[:,4] = range(downstream.shape[0])
downstream.iloc[:,1] = downstream.iloc[:,2]
downstream.iloc[:,2] = downstream.iloc[:,2] + 500
a_down = pybedtools.BedTool.from_dataframe(downstream)

result = pd.read_table(output_path + 'chr' + chromosome + '.bed',header=None)
result.columns = ['chr','start','stop','name','score','strand']
@@ -78,14 +63,8 @@ def add_mappability_score(chromosome):

b = pybedtools.example_bedtool(artificial_reads_path + 'chr' + chromosome + '.txt')

aup_and_b = a_up.intersect(b,wa=True,wb=True)
adown_and_b = a_down.intersect(b,wa=True,wb=True)

aup_and_b.saveas(output_path + 'upstream_intersect_chr' + chromosome + '.bed')
adown_and_b.saveas(output_path + 'downstream_intersect_chr' + chromosome + '.bed')

aup_and_b = pd.read_table(output_path + 'upstream_intersect_chr' + chromosome + '.bed', header=None)
adown_and_b = pd.read_table(output_path + 'downstream_intersect_chr' + chromosome + '.bed', header=None)
aup_and_b = a_up.intersect(b,wa=True,wb=True).to_dataframe()
adown_and_b = a_down.intersect(b,wa=True,wb=True).to_dataframe()

aup_and_b.columns = ['chr','start','stop','name','index','strand','chr_map','start_map','stop_map','map_up']
adown_and_b.columns = ['chr','start','stop','name','index','strand','chr_map','start_map','stop_map','map_down']
@@ -101,21 +80,17 @@ def compute_mappability_score(values):

r1 = 0
for r in np.array(df.ind):
if result.ix[r,'strand'] == '+':
result.ix[r,'mappability_upstream'] = np.array(df.ix[r1,'map_up']).astype(np.str)
result.ix[r,'mappability_downstream'] = np.array(df.ix[r1,'map_down']).astype(np.str)
if result.ix[r,'strand'] == '-':
result.ix[r,'mappability_upstream'] = np.array(df.ix[r1,'map_down']).astype(np.str)
result.ix[r,'mappability_downstream'] = np.array(df.ix[r1,'map_up']).astype(np.str)
if result.loc[r,'strand'] == '+':
result.loc[r,'mappability_upstream'] = np.array(df.loc[r1,'map_up']).astype(np.str)
result.loc[r,'mappability_downstream'] = np.array(df.loc[r1,'map_down']).astype(np.str)
if result.loc[r,'strand'] == '-':
result.loc[r,'mappability_upstream'] = np.array(df.loc[r1,'map_down']).astype(np.str)
result.loc[r,'mappability_downstream'] = np.array(df.loc[r1,'map_up']).astype(np.str)
r1 += 1

result.score = 1
result.to_csv(path_or_buf=output_path + 'chr' + chromosome + '_map.bed',sep='\t',header=False,index=False)

os.remove(output_path + 'chr' + chromosome + '_upstream.bed')
os.remove(output_path + 'chr' + chromosome + '_downstream.bed')
os.remove(output_path + 'upstream_intersect_chr'+ chromosome +'.bed')
os.remove(output_path + 'downstream_intersect_chr' + chromosome + '.bed')

print "Mappability score for chr" + chromosome + " complete."


@@ -0,0 +1,58 @@
while getopts h:i:b:s:p: option
do
case "${option}"
in
h) hictool=${OPTARG};; # the path to the HiCtool scripts.
i) input_file=${OPTARG};; # Project object file in .hdf5 format.
b) bin_size=${OPTARG};; # Bin size.
s) species=${OPTARG};; # Species.
p) threads=${OPTARG};; # Number of parallel threads to be used.
esac
done

dir=$(dirname "${input_file}")

### Create output directory
checkMakeDirectory(){
echo -e "checking directory: $1"
if [ ! -e "$1" ]; then
echo -e "\tmakedir $1"
mkdir -p "$1"
fi
}

start_time=$(date)

output_dir=$dir"/observed_"$bin_size"/" # output directory
checkMakeDirectory $output_dir

python $hictool"HiCtool_global_map_rows.py" \
-i $input_file \
-o $output_dir \
-b $bin_size \
-s $species \
-c $hictool"chromSizes/" \
-p $threads

while read chromosome size; do
cat $output_dir"matrix_full_line_observed_"$chromosome".txt" >> $output_dir"HiCtool_observed_global_"$bin_size".txt"
done < $hictool"chromSizes/"$species".chrom.sizes"

echo "Start generating global observed matrix: "$start_time
echo "End generating global observed matrix: $(date)"

### Split each matrix line by single contact matrices
echo -n "Splitting global matrix into single tab separated matrices ..."
while read chromosome_row size_row; do
matrix_line=$output_dir"matrix_full_line_observed_"$chromosome_row".txt"
k=0
while read chromosome_col size_col; do
start=`expr $k + 1`
end=`expr $k + $size_col / $bin_size`
cut -d$'\t' -f"$start"-"$end" $matrix_line > $output_dir"chr"$chromosome_row"_chr"$chromosome_col"_"$bin_size".txt"
k=`expr $k + $size_col / $bin_size`
done < $hictool"chromSizes/"$species".chrom.sizes"
done < $hictool"chromSizes/"$species".chrom.sizes"
echo "Done!"

rm $output_dir"matrix_full_line"*

0 comments on commit 6ab9870

Please sign in to comment.
You can’t perform that action at this time.